We run pf-based firewalls leveraging CARP and pfsync, and maintain load-balanced services on them using relayd. Yesterday, relayd failed on the backup firewall, and would not restart via service commands. However, it would run with relayd -d (and relayd -dv). States on the firewalls were nominal, memory usage was nominal. I'm not sure if this was relayd or kernel /dev/pf or which. Nov 11 15:01:58 backup-firewall relayd[80498] startup Nov 11 15:02:13 backup-firewall relayd[80499] fatal: sync_table: cannot set address list: Cannot allocate memory Nov 11 15:02:13 backup-firewall relayd[80500] hce exiting, pid 80500 Nov 11 15:02:13 backup-firewall relayd[80501] relay exiting, pid 80501 Nov 11 15:02:13 backup-firewall relayd[80503] relay exiting, pid 80503 Nov 11 15:02:13 backup-firewall relayd[80504] relay exiting, pid 80504 Nov 11 15:02:13 backup-firewall relayd[80505] ca exiting, pid 80505 Nov 11 15:02:13 backup-firewall relayd[80502] ca exiting, pid 80502 Nov 11 15:02:13 backup-firewall relayd[80506] ca exiting, pid 80506 Nov 11 15:02:13 backup-firewall relayd[80498] lost child: pfe exited abnormally Nov 11 15:02:13 backup-firewall relayd[80498] parent terminating, pid 80498
Hi, Are you able to run something like memtest on that firewall? Something that would stress test the RAM. At first glance it looks like it might be a hardware problem ("cannot allocate memory" despite low memory usage)
(In reply to tech-lists from comment #1) I think memory issues would have come up and been a lot more disruptive during its upgrade to FreeBSD 13.0, or within the period it was being proofed out as the primary firewall. It really feels like my issue and https://lists.freebsd.org/archives/freebsd-pf/2021-October/000136.html are looking at the same problem from different aspects.
(In reply to jjasen from comment #2) I also use pf-badhosts but have seen no issues. Among other machines, it's running on a raspberry pi4 (8GB) on stable/13 and also has the net.pf.request_maxcount=400000 set as per https://geoghegan.ca/pfbadhost.html#instructions. In /var/log/messages, there's lines like Nov 11 00:00:20 REDACTED unbound-adblock[30209]: Changes (+/-): +7 Domain total : 128951 Nov 12 00:00:11 REDACTED pf-badhost[43205]: IPv4 addresses in table: 620442279 In a similar context (not with pf-badhosts) on a different (amd64) machine (also 8GB) but running 12.0 or 12.1 where the maxcount value was set in boot/loader.conf, I ran up against the default limit (65536 I think) and had to manually set it to something like 254000. But I got an error message that was sufficiently descriptive to allow me to solve the problem. IIRC it actually said that maxcount needed to be increased. Unfortunately the error your system is reporting isn't as descriptive
Created attachment 229578 [details] relayctl show redirect output relayctl show redirect output
Created attachment 229579 [details] relayd.conf relayd.conf file in use on the two PF firewalls in question.
Created attachment 229580 [details] vmstat -m output soon after a relayd crash vmstat -m output soon after a relayd crash
Created attachment 229582 [details] vmstat-s output vmstat -s from a broken pf firewall
As an update, this is a general problem now with pf, it seems. I have two systems where relayd will not reload after start, and attempting rules reloads via pfctl result in: etc/pf/tables.conf:938: cannot define table clients: Cannot allocate memory ... <ad nauseum> pfctl: Syntax error in config file: pf rules not loaded Some have indicated low memory causing the issue. These systems have 32GB of ram each, dual processor, multi-core, Intel E5-2697 v2.
(In reply to jjasen from comment #0) Hi, I'm seeing the problem in a different context (arm64 on recent -current) please see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260406
(In reply to jjasen from comment #8) Hi, I'm now seeing this problem on AMD64 like you, on stable/13-n249464-d0199f27c06 (built 1st March 2022)
(In reply to jjasen from comment #8) I've applied the patch from https://bugs.freebsd.org/bugzilla/attachment.cgi?id=230375&action=diff recompiled and rebooted, so far the problem hasn't reappeared
I believe this was addressed in a 13.0 errata release, but forgot which one.