Bug 275470 - Kernel Panic in IPFW when adding entries to table
Summary: Kernel Panic in IPFW when adding entries to table
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-ipfw (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-12-01 12:46 UTC by Thierry Dussuet
Modified: 2023-12-05 16:28 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Thierry Dussuet 2023-12-01 12:46:35 UTC
Hi everyone when adding entries to an ipfw table through a cron job, after a few days (8-9 days) it triggers a kernel panic:

Fatal trap 12: page fault while in kernel mode
cpuid = 11; apic id = 0b
fault virtual address   = 0x2c
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff81f5daf2
stack pointer           = 0x28:0xfffffe016860c800
frame pointer           = 0x28:0xfffffe016860c900
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 78144 (ipfw)
rdi: fffff800050f8000 rsi: 0000000000000000 rdx: 0000000000001000
rcx: 0000000000000040  r8: fffffe01c1a59000  r9: 0000000000000b40
rax: 000000000000f4fd rbx: fffff800050f8000 rbp: fffffe016860c900
r10: 4000000000000000 r11: fffffe0167fcd540 r12: fffff802f021d700
r13: 0000000000000002 r14: fffffe016860c958 r15: fffffe016860c888
trap number             = 12
panic: page fault
cpuid = 11
time = 1701385226
KDB: stack backtrace:
#0 0xffffffff80b9002d at kdb_backtrace+0x5d
#1 0xffffffff80b43132 at vpanic+0x132
#2 0xffffffff80b42ff3 at panic+0x43
#3 0xffffffff8100c85c at trap_fatal+0x40c
#4 0xffffffff8100c8af at trap_pfault+0x4f
#5 0xffffffff80fe3828 at calltrap+0x8
#6 0xffffffff81f530bb at add_table_entry+0x54b
#7 0xffffffff81f572e0 at manage_table_ent_v1+0x1c0
#8 0xffffffff81f4d069 at ipfw_ctl3+0x689
#9 0xffffffff80beadc3 at sogetopt+0xd3
#10 0xffffffff80bef79f at kern_getsockopt+0xaf
#11 0xffffffff80bef6c2 at sys_getsockopt+0x52
#12 0xffffffff8100d119 at amd64_syscall+0x109
#13 0xffffffff80fe413b at fast_syscall_common+0xf8
Uptime: 8d23h59m54s

# uname -v
FreeBSD 14.0-RELEASE #0 releng/14.0-n265380-f9716eee8ab4: Fri Nov 10 05:57:23 UTC 2023     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC

The cron job does:
rsync -aqz rsync-mirrors.uceprotect.net::RBLDNSD-ALL /tmp/dnsbl/

awk '/^[0-9]/ && !/127.0.0/ {print $1}' /tmp/dnsbl/dnsbl-1.uceprotect.net | xargs -n10 -P1 ipfw -q table 53 add

# wc -l /tmp/dnsbl/dnsbl-1.uceprotect.net
   99568 /tmp/dnsbl/dnsbl-1.uceprotect.net

The ipfw tables in use:
00001 deny ip from table(1) to me
00002 deny ip from table(22) to me
00003 deny ip from table(42) to me
00004 deny ip from table(53) to me
(and then other rules including nat)

# ipfw table 53 detail
--- table(53), set(0) ---
 kindex: 4, type: addr
 references: 1, valtype: legacy
 algorithm: addr:radix
 items: 49760, size: 5971496
 IPv4 algorithm radix info
  items: 49760 itemsize: 120
 IPv6 algorithm radix info
  items: 0 itemsize: 128

The -n10 and -P1 arguments for xargs were a try to reduce parallel calls to ipfw, it seems to have delayed the panics by a few days but I can not say for certain.

Is there any missing information, or action, which could help track down what is happening? Also willing to switch to -CURRENT and try any patches if that might help.

(Found #272073 with the workaround of setting sysctl kern.ipc.mb_use_ext_pgs=0 for what seems like a similar kernel panic reason, although from a different path inside ipfw)
Comment 1 Eugene Perevyazko 2023-12-01 14:37:12 UTC
Skipping the problem of kernel panic itself I'd like to propose a workaround for your script:

TMPFILE=`mktemp -t tbl53` || exit 1
awk '/^[0-9]/ && !/127.0.0/ {print "table 53 add "$1}' /tmp/dnsbl/dnsbl-1.uceprotect.net > $TMPFILE
ipfw -q $TMPFILE
rm $TMPFILE

It also should be much faster and lighter on CPU.
for example it takes less than a second of wall time on ancient core2:
# ipfw table 53 flush
# ipfw table 53 list | wc
       0       0       0
# time ipfw -q /tmp/tbl53.0kfeoXCu 
0.179u 0.242s 0:00.42 97.6%	158+184k 0+0io 0pf+0w
# ipfw table 53 list | wc
   65698  131396 1278284
Comment 2 Thierry Dussuet 2023-12-01 19:06:21 UTC
(In reply to Eugene Perevyazko from comment #1)

Thank you, that's very kind, and a great idea!