Bug 251414 - pf sometimes panics when reloading ruleset with tables
Summary: pf sometimes panics when reloading ruleset with tables
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2020-11-26 20:25 UTC by sigsys
Modified: 2022-10-12 00:49 UTC (History)
2 users (show)

See Also:


Attachments
pf.conf (3.84 KB, text/plain)
2020-11-26 21:07 UTC, sigsys
no flags Details
table (42.40 KB, text/plain)
2020-11-26 21:07 UTC, sigsys
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description sigsys 2020-11-26 20:25:46 UTC
Happens on 12.2-RELEASE when doing a `service pf reload` when pf.conf loads a table from a file with about 2700 entries.  Never had this problem before I added the table.


[1232428] Fatal trap 12: page fault while in kernel mode
[1232428] cpuid = 2; apic id = 02
[1232428] fault virtual address	= 0x2010
[1232428] fault code		= supervisor write data, page not present
[1232428] instruction pointer	= 0x20:0xffffffff83595228
[1232428] stack pointer	        = 0x0:0xfffffe0000504540
[1232428] frame pointer	        = 0x0:0xfffffe00005045a0
[1232428] code segment		= base rx0, limit 0xfffff, type 0x1b
[1232428] 			= DPL 0, pres 1, long 1, def32 0, gran 1
[1232428] processor eflags	= interrupt enabled, resume, IOPL = 0
[1232428] current process		= 0 (if_io_tqg_2)
[1232428] trap number		= 12
[1232428] panic: page fault
[1232428] cpuid = 2
[1232428] time = 1606389244
[1232428] KDB: stack backtrace:
[1232428] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000504200
[1232428] vpanic() at vpanic+0x17b/frame 0xfffffe0000504250
[1232428] panic() at panic+0x43/frame 0xfffffe00005042b0
[1232428] trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000504310
[1232428] trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000504360
[1232428] trap() at trap+0x286/frame 0xfffffe0000504470
[1232428] calltrap() at calltrap+0x8/frame 0xfffffe0000504470
[1232428] --- trap 0xc, rip = 0xffffffff83595228, rsp = 0xfffffe0000504540, rbp = 0xfffffe00005045a0 ---
[1232428] pfr_update_stats() at pfr_update_stats+0x1a8/frame 0xfffffe00005045a0
[1232428] pf_test() at pf_test+0xebe/frame 0xfffffe0000504740
[1232428] pf_check_in() at pf_check_in+0x1d/frame 0xfffffe0000504760
[1232428] pfil_run_hooks() at pfil_run_hooks+0x87/frame 0xfffffe00005047f0
[1232428] ip_input() at ip_input+0x40e/frame 0xfffffe00005048a0
[1232428] netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00005048f0
[1232428] ether_demux() at ether_demux+0x138/frame 0xfffffe0000504920
[1232428] ether_nh_input() at ether_nh_input+0x33b/frame 0xfffffe0000504980
[1232428] netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00005049d0
[1232428] ether_input() at ether_input+0x4b/frame 0xfffffe0000504a00
[1232428] iflib_rxeof() at iflib_rxeof+0xae6/frame 0xfffffe0000504ae0
[1232428] _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe0000504b20
[1232428] gtaskqueue_run_locked() at gtaskqueue_run_locked+0x121/frame 0xfffffe0000504b80
[1232428] gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xb6/frame 0xfffffe0000504bb0
[1232428] fork_exit() at fork_exit+0x7e/frame 0xfffffe0000504bf0
[1232428] fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000504bf0
[1232428] --- trap 0, rip = 0, rsp = 0, rbp = 0 ---



[10502] Fatal trap 12: page fault while in kernel mode
[10502] cpuid = 12; apic id = 0c
[10502] fault virtual address	= 0xc030
[10502] fault code		= supervisor write data, page not present
[10502] instruction pointer	= 0x20:0xffffffff83595228
[10502] stack pointer	        = 0x28:0xfffffe0000536330
[10502] frame pointer	        = 0x28:0xfffffe0000536390
[10502] code segment		= base rx0, limit 0xfffff, type 0x1b
[10502] 			= DPL 0, pres 1, long 1, def32 0, gran 1
[10502] processor eflags	= interrupt enabled, resume, IOPL = 0
[10502] current process		= 0 (if_io_tqg_12)
[10502] trap number		= 12
[10502] panic: page fault
[10502] cpuid = 12
[10502] time = 1606400775
[10502] KDB: stack backtrace:
[10502] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000535ff0
[10502] vpanic() at vpanic+0x17b/frame 0xfffffe0000536040
[10502] panic() at panic+0x43/frame 0xfffffe00005360a0
[10502] trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000536100
[10502] trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000536150
[10502] trap() at trap+0x286/frame 0xfffffe0000536260
[10502] calltrap() at calltrap+0x8/frame 0xfffffe0000536260
[10502] --- trap 0xc, rip = 0xffffffff83595228, rsp = 0xfffffe0000536330, rbp = 0xfffffe0000536390 ---
[10502] pfr_update_stats() at pfr_update_stats+0x1a8/frame 0xfffffe0000536390
[10502] pf_test() at pf_test+0xebe/frame 0xfffffe0000536530
[10502] pf_check_out() at pf_check_out+0x1d/frame 0xfffffe0000536550
[10502] pfil_run_hooks() at pfil_run_hooks+0x87/frame 0xfffffe00005365e0
[10502] ip_output() at ip_output+0xaf8/frame 0xfffffe0000536730
[10502] ip_forward() at ip_forward+0x32e/frame 0xfffffe00005367f0
[10502] ip_input() at ip_input+0x7c5/frame 0xfffffe00005368a0
[10502] netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00005368f0
[10502] ether_demux() at ether_demux+0x138/frame 0xfffffe0000536920
[10502] ether_nh_input() at ether_nh_input+0x33b/frame 0xfffffe0000536980
[10502] netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe00005369d0
[10502] ether_input() at ether_input+0x4b/frame 0xfffffe0000536a00
[10502] iflib_rxeof() at iflib_rxeof+0xae6/frame 0xfffffe0000536ae0
[10502] _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe0000536b20
[10502] gtaskqueue_run_locked() at gtaskqueue_run_locked+0x121/frame 0xfffffe0000536b80
[10502] gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xb6/frame 0xfffffe0000536bb0
[10502] fork_exit() at fork_exit+0x7e/frame 0xfffffe0000536bf0
[10502] fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000536bf0
[10502] --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Comment 1 Kristof Provost freebsd_committer freebsd_triage 2020-11-26 20:35:16 UTC
Please include your pf.conf and the table file.
Comment 2 sigsys 2020-11-26 21:07:34 UTC
Created attachment 220012 [details]
pf.conf
Comment 3 sigsys 2020-11-26 21:07:51 UTC
Created attachment 220013 [details]
table
Comment 4 sigsys 2020-11-26 21:10:30 UTC
(In reply to Kristof Provost from comment #1)
I must admit, even though the ruleset isn't too big, there might be A LITTLE BIT of weirdness going on there with FIBs and firewalled VMs on separate bridges and routed taps.  But it was working fine before the big table.

Also I already had tables in it, but with just a few entries and they very rarely changed.  They weren't being reloaded automatically with the rest of the ruleset unlike that new <firehol-level1> one I tried to add.
Comment 5 Kristof Provost freebsd_committer freebsd_triage 2020-11-26 21:28:14 UTC
Do you have any pf_flags or anything else set in /etc/rc.conf?
Comment 6 sigsys 2020-11-26 21:36:13 UTC
(In reply to Kristof Provost from comment #5)

No pf_flags.  Only pf settings are:

pf_enable="YES"
pflog_enable="YES" 
pflog_flags="-s 0"
Comment 7 sigsys 2020-11-26 22:42:41 UTC
OOOPS.  Sorry, I made a mistake.  I kept editing my pf.conf while trying to figure this out and got confused as to what edits I had on exactly when the crashes happen.

It seems to be related to the "counters" keyword of tables.  I had it on at some point and it can cause crashes later on when you reload.

	table <firehol-level1> persist counters file "/etc/pf/firehol-level1-blocklist.tab"

If I've been reading this stack trace and the disassembly correctly, the crash must be happening right there:

https://github.com/freebsd/freebsd/blob/releng/12.2/sys/netpfil/pf/pf_table.c#L2112-L2113

   2c228: 65 48 83 01 01                addq    $1, %gs:(%rcx)
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2020-11-27 14:56:53 UTC
This one is my fault I think, but I haven't been able to trigger it on head by reloading a table while toggling the "counters" keyword.
Comment 9 sigsys 2020-11-28 01:18:19 UTC
(In reply to Mark Johnston from comment #8)
Looks like I found an easy way to reproduce on -CURRENT.  It doesn't even need a big table loaded from a file.  Looks like it's related to toggling the counters support like you said.  I guess I somehow managed to keep messing with this flag while trying out rulesets.

Load this ruleset and enable pf:

	pass all
	table <tab> { self }
	pass in log to <tab>

Then switch to this ruleset:

	pass all
	table <tab> counters { self }
	pass in log to <tab>

Then send in some traffic that should create a new state.  Panics every time for me.  If the table already existed it might need to be destroyed first.

I think the problem is in pfr_commit_ktable().  When it transfers the addresses from the "shadow" inactive table to the active table, in some cases the existing entries will be reused and those won't have their counters if the active table wasn't created with counters support to begin with.  Maybe it could end up leaking counters too if you try to disabling them in the new table?  I think the table can end up with a mix of entries with and without counters.
Comment 10 Mark Johnston freebsd_committer freebsd_triage 2020-12-01 18:41:59 UTC
(In reply to sigsys from comment #9)
Thanks for the analysis, I believe you are right.

This patch fixes the bug for me: https://reviews.freebsd.org/D27440
Comment 11 sigsys 2020-12-01 21:31:11 UTC
(In reply to Mark Johnston from comment #10)
Works great here.  Switching between two rulesets with and without counters for the table in a loop while killing the table in a loop while running iperf works.

I didn't think of exchanging the counters like that.  It's a nice solution, it avoids having to allocate new ones in a function that must not fail and it gets rid of the existing counters when they're not needed anymore.
Comment 12 commit-hook freebsd_committer freebsd_triage 2020-12-02 16:02:07 UTC
A commit references this bug:

Author: markj
Date: Wed Dec  2 16:01:43 UTC 2020
New revision: 368276
URL: https://svnweb.freebsd.org/changeset/base/368276

Log:
  pf: Fix table entry counter toggling

  When updating a table, pf will keep existing table entry structures
  corresponding to addresses that are in both of the old and new tables.
  However, the update may also enable or disable per-entry counters which
  are allocated separately.  Thus when toggling PFR_TFLAG_COUNTERS, the
  entries may be missing counters or may have unused counters allocated.

  Fix the problem by modifying pfr_ina_commit() to transfer counters
  from or to entries in the shadow table.

  PR:		251414
  Reported by:	sigsys@gmail.com
  Reviewed by:	kp
  MFC after:	1 week
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D27440

Changes:
  head/sys/netpfil/pf/pf_table.c
Comment 13 commit-hook freebsd_committer freebsd_triage 2020-12-02 16:34:11 UTC
A commit references this bug:

Author: kp
Date: Wed Dec  2 16:33:23 UTC 2020
New revision: 368277
URL: https://svnweb.freebsd.org/changeset/base/368277

Log:
  pf tests: Test case for bug #251414

  Changing a table from not having counters to having counters (or vice versa)
  may trigger panics.

  PR:		251414
  MFC after:	1 week
  Differential Revision:	https://reviews.freebsd.org/D27441

Changes:
  head/tests/sys/netpfil/pf/table.sh
Comment 14 commit-hook freebsd_committer freebsd_triage 2020-12-09 00:57:04 UTC
A commit references this bug:

Author: markj
Date: Wed Dec  9 00:56:15 UTC 2020
New revision: 368470
URL: https://svnweb.freebsd.org/changeset/base/368470

Log:
  MFC r368276:
  pf: Fix table entry counter toggling

  PR:	251414

Changes:
_U  stable/12/
  stable/12/sys/netpfil/pf/pf_table.c
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2020-12-09 00:58:58 UTC
Thanks for the report and analysis.
Comment 16 commit-hook freebsd_committer freebsd_triage 2020-12-09 17:18:20 UTC
A commit references this bug:

Author: kp
Date: Wed Dec  9 17:17:45 UTC 2020
New revision: 368488
URL: https://svnweb.freebsd.org/changeset/base/368488

Log:
  MFC r368277:

  pf tests: Test case for bug #251414

  Changing a table from not having counters to having counters (or vice versa)
  may trigger panics.

  PR:		251414

Changes:
_U  stable/12/
  stable/12/tests/sys/netpfil/pf/table.sh