Ever since I updated this server from 10.3-RELEASE to 11.1-RELEASE a few weeks ago, it sometimes just stops accepting connections (existing connections are fine). The kernel complains about too many firewall states: [zone: pf states] PF states limit reached A quick look at those states with pfctl reveals ten-thousands of old and dead connections that should be long gone - for example, FIN_WAIT_2 states with an age of three hours. The pfctl output says "expires in 00:00:00" for all of these connections, so pf obviously agrees that they're dead but doesn't delete them for some reason. When I first diagnosed this problem, adding "set timeout interval 1" to the pf configuration immediately cleared out the old states and the server was up and running again. However, this did not permanently fix the issue. The server keeps going down regularly and I have to manually flush the pf states to get it back online.
I don't think I've seen that before. Do you have a non-default kernel config? (I'm especially interested in knowing if you're running VIMAGE, but any non-default settings are interesting to know about) Can you check if the pf purge thread is running? ('[pf purge']) Please also give this dtrace script a quick try: #!/usr/sbin/dtrace -s fbt:pf:pf_purge_expired_states:entry { } fbt:pf:pf_purge_expired_states:return { } fbt:pf:pf_purge_unlinked_rules:entry { } fbt:pf:pf_purge_unlinked_rules:return { } fbt:pf:pf_purge_expired_fragments:entry { } fbt:pf:pf_purge_expired_src_nodes:entry { } fbt:pf:pf_purge_expired_src_nodes:return { } fbt:pf:pf_purge_thread:entry { } It should show us if the purge code is running or not.
The purge thread is there: root 16 0,0 0,0 0 16 - DL 10:57 0:00,04 - [pf purge] The dtrace script fails: dtrace: failed to compile script ./pf.dtrace: line 3: probe description fbt:pf:pf_purge_expired_states:entry does not match any probes The kernel configuration is mostly just GENERIC with plenty of unneeded drivers removed, but VIMAGE is enabled and "device pf" as well.
(In reply to noah.bergbauer from comment #2) The function may have wound up being inlined in your kernel config, so we can't instrument it. Just remove whatever functions dtrace can't find, we'll get something out of what's left over. I'm building an equivalent config to try to reproduce, but that will take a bit of time.
I just realized that because pf is compiled into the kernel, the dtrace probes need to be fbt:kernel:* instead of fbt:pf:* . With that, it starts up: dtrace: script './pf.dtrace' matched 8 probes And then just sits there with no output. As states were filling up again, I removed the "set timeout interval 1" from the config and re-added it. This again cleared out the dead states and dtrace started printing things for a few seconds, but now it's silent again.
Okay, that's an interesting clue. Let's see if we can work out where it's getting stuck. Can you do 'sudo procstat -kk <pid of pf purge>'?
# procstat -kk 16 16 100102 pf purge - mi_switch+0xe5 sleepq_timedwait+0x42 _sleep+0x235 pf_purge_thread+0x8e fork_exit+0x85 fork_trampoline+0xe
Hmm, that looks entirely normal. Your purge thread is sleeping in the rw_sleep() call. It should be waking up every 10 seconds or so. It's either not waking up, or it's not actually doing anything when it wakes up. I have no idea how either one could happen.
The workaround I'm using now is: echo "set timeout interval 2" | pfctl -mf - && echo "set timeout interval 1" | pfctl -mf - Every hour in crontab. It causes the purge thread to wake up (https://github.com/freebsd/freebsd/blob/e0bfaddbc2f6dac2c494ca1fa0566cd4e2ee0659/sys/netpfil/pf/pf_ioctl.c#L1924) and from there it works fine for a while (very random, last time it was just a few seconds but now it's 20 minutes and still going). So this means that it's apparently just not waking up for some reason.
That's quite odd. It's almost as if rw_sleep() isn't doing what it's supposed to do. That seems unlikely, but not impossible. Is the output of 'sysctl kern.hz' sane? If you can experiment on the box it might be worth testing to see if r320848 (PR 219251) helps. I can't quite convince myself that it would produce these symptoms, but it can't really hurt to try.
# sysctl kern.hz kern.hz: 1000 I can't really experiment on the machine but from what I can see, r320848 shouldn't change anything anyways because VNET isn't in use right now.
(In reply to noah.bergbauer from comment #10) Are there any limits in your ruleset? And what does the "pfctl -vsi" show? Sounds familiar... just like bug #217997.
set limit { states 100000, src-nodes 10000 } One of my first attempts to fix this was increasing both limits 10x - didn't help though. # pfctl -vsi No ALTQ support in kernel ALTQ related functions disabled Status: Enabled for 1 days 14:44:53 Debug: Urgent Hostid: 0x4b1e78c2 Checksum: 0x67f2a9cbd7b0d65ce52864ecfc156ebb State Table Total Rate current entries 3839 searches 360179452 2582.1/s inserts 594949 4.3/s removals 591110 4.2/s Source Tracking Table current entries 0 searches 0 0.0/s inserts 0 0.0/s removals 0 0.0/s Counters match 689782 4.9/s bad-offset 0 0.0/s fragment 16 0.0/s short 0 0.0/s normalize 0 0.0/s memory 0 0.0/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 0 0.0/s proto-cksum 450 0.0/s state-mismatch 942 0.0/s state-insert 0 0.0/s state-limit 0 0.0/s src-limit 0 0.0/s synproxy 0 0.0/s map-failed 0 0.0/s Limit Counters max states per rule 0 0.0/s max-src-states 0 0.0/s max-src-nodes 0 0.0/s max-src-conn 0 0.0/s max-src-conn-rate 0 0.0/s overload table insertion 0 0.0/s overload flush states 0 0.0/s
(In reply to noah.bergbauer from comment #12) > Status: Enabled for 1 days 14:44:53 Have you had any issues during this period? And do you know which rule produces expired states?
(In reply to Max from comment #13) Maybe, maybe not. The point of my workaround is to get a mostly functioning machine. However, the reboot right before this period was necessary because for some reason not even that workaround works consistently.
(In reply to noah.bergbauer from comment #14) Given the nature of your workaround and what we've seen from Dtrace I don't think that #217997 is the problem. I'm also pretty sure that that fix is included in 11.1.
(In reply to noah.bergbauer from comment #14) I'll try to reproduce the problem. But I need some starting point. Rules, dead connections state entries... Anything?
(In reply to Kristof Provost from comment #15) You are right. It is not the problem. But it looks quite similar.
I encounter the same problem with current on arm64 - pine64+ 2GB: FreeBSD norquay.restart.bel 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r320599M: Sat Jul 8 18:24:19 CEST 2017 root@norquay.restart.bel:/usr/obj/usr/src/sys/NORQUAY arm64
(In reply to hlh from comment #18) Can you also run the dtrace script and 'sudo procstat -kk <pid of pf purge>'? Does the workaround also work for you? Can you post your ruleset?
The same workaround is working for me: echo "set timeout interval 5" | pfctl -mf - echo "set timeout interval 10" | pfctl -mf - but to be sure I also do: nohup service pf restart For more than 1 week (without any config change) the problem don't exist. Next time it appears I will do the dtrace and procstat.
Here is my pf.conf: # $FreeBSD: src/etc/pf.conf,v 1.3 2006/01/27 17:16:20 mlaier Exp $ # $OpenBSD: pf.conf,v 1.21 2003/09/02 20:38:44 david Exp $ # # See pf.conf(5) and /usr/share/examples/pf for syntax and examples. # Required order: options, normalization, queueing, translation, filtering. # Macros and tables may be defined and used anywhere. # Note that translation rules are first match while filter rules are last match. # Macros: define common values, so they can be referenced and changed easily. #ext_if="ext0" # replace with actual external interface name i.e., dc0 #int_if="int0" # replace with actual internal interface name i.e., dc1 #internal_net="10.1.1.1/8" #external_addr="192.168.1.1" #--- RestartSoft --- int_if="awg0" int_net="192.168.24.0/24, 2001:41d0:8:bdbe:1::/80" vdsl_if="ng0" tunnel_if="gif0" virtualbox_net="192.168.22.0/24, 2001:41d0:8:bdbe:2::/80" tignes="5.135.182.190" ftp_passive="30000:31000" smtp1="24" http1="8080" squid="3128" tcp_services="{" "ftp" $ftp_passive $smtp1 $squid "nntp" "http" $http1 "https" "}" bittorrent="6881:6889" donkey_tcp="{ 4662 6346 6347 6667 6881 6882 7254 21776 }" donkey_udp="{ 4666 6346 6347 7254 21780 }" meribel="192.168.24.8" platon="192.168.24.192" emule_tcp="{ 4242 4661 4662 4881 }" emule_udp="{ 4246 4665 4672 }" # Tables: similar to macros, but more flexible for many addresses. #table <foo> { 10.0.0.0/8, !10.1.0.0/16, 192.168.0.0/24, 192.168.1.18 } #--- RestartSoft --- table <rfc1918> const { 127.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12, 10.0.0.0/8 } table <internal_nets> const { $int_net, $virtualbox_net } table <bad_hosts> persist # Options: tune the behavior of pf, default values are given. #set timeout { interval 10, frag 30 } #set timeout { tcp.first 120, tcp.opening 30, tcp.established 86400 } #set timeout { tcp.closing 900, tcp.finwait 45, tcp.closed 90 } #set timeout { udp.first 60, udp.single 30, udp.multiple 60 } #set timeout { icmp.first 20, icmp.error 10 } #set timeout { other.first 60, other.single 30, other.multiple 60 } #set timeout { adaptive.start 0, adaptive.end 0 } #--- RestartSoft --- set limit { states 30000, src-nodes 20000, frags 20000 } #set loginterface none #set optimization normal #set block-policy drop #set require-order yes #set fingerprints "/etc/pf.os" #--- RestartSoft --- set block-policy drop set debug urgent set loginterface $vdsl_if set state-policy if-bound # Normalization: reassemble fragments and resolve or reduce traffic ambiguities. #scrub in all #--- RestartSoft --- scrub in all # Queueing: rule-based bandwidth control. #altq on $ext_if bandwidth 2Mb cbq queue { dflt, developers, marketing } #queue dflt bandwidth 5% cbq(default) #queue developers bandwidth 80% #queue marketing bandwidth 15% #--- RestartSoft --- #- altq on $vdsl_if bandwidth 10Mb cbq queue { dflt, p2p_upload } #- queue dflt bandwidth 80% cbq(default) #- queue p2p_upload bandwidth 20% # Translation: specify how addresses are to be mapped or redirected. # nat: packets going out through $ext_if with source address $internal_net will # get translated as coming from the address of $ext_if, a state is created for # such packets, and incoming packets will be redirected to the internal address. #nat on $ext_if from $internal_net to any -> ($ext_if) #--- RestartSoft --- # Translate all internal networks # Special case for ekiga (voip) on meribel nat on $vdsl_if proto udp from $meribel to any -> ($vdsl_if) static-port nat on $vdsl_if inet from <internal_nets> to any -> ($vdsl_if) # rdr: packets coming in on $ext_if with destination $external_addr:1234 will # be redirected to 10.1.1.1:5678. A state is created for such packets, and # outgoing packets will be translated as coming from the external address. #rdr on $ext_if proto tcp from any to $external_addr/32 port 1234 -> 10.1.1.1 port 5678 # rdr outgoing FTP requests to the ftp-proxy #rdr on $int_if proto tcp from any to any port ftp -> 127.0.0.1 port 8021 # spamd-setup puts addresses to be redirected into table <spamd>. #table <spamd> persist #no rdr on { lo0, lo1 } from any to any #rdr inet proto tcp from <spamd> to any port smtp -> 127.0.0.1 port 8025 #--- RestartSoft --- # Ekiga is running on meribel (192.168.24.8) rdr on $vdsl_if proto udp from any to ($vdsl_if) port 5000:5100 -> $meribel rdr on $vdsl_if proto tcp from any to ($vdsl_if) port 1720 -> $meribel # Filtering: the implicit first two rules are #pass in all #pass out all # block all incoming packets but allow ssh, pass all outgoing tcp and udp # connections and keep state, logging blocked packets. #block in log all #pass in on $ext_if proto tcp from any to $ext_if port 22 keep state #pass out on $ext_if proto { tcp, udp } all keep state #--- RestartSoft --- # Setup a default deny policy block in log all block out log all # pass incoming packets destined to the addresses given in table <foo>. #pass in on $ext_if proto { tcp, udp } from any to <foo> port 80 keep state # pass incoming ports for ftp-proxy #pass in on $ext_if inet proto tcp from any to $ext_if port > 49151 keep state # Alternate rule to pass incoming ports for ftp-proxy # NOTE: Please see pf.conf(5) BUGS section before using user/group rules. #pass in on $ext_if inet proto tcp from any to $ext_if user proxy keep state #--- RestartSoft --- # Skip filtering on loopback interfaces set skip on lo0 # Block Spoofed Packets # antispool don't work for bridge block in log quick from ($int_if) to any # Block RFC1918 addresses on VDSL block log quick on $vdsl_if from <rfc1918> to any # Block bad hosts trying ssh or smtp block log quick on $vdsl_if from <bad_hosts> to any # Block networks known to use Windows Metafile Vulnerability --- # see http://isc.sans.org/diary.php?storyid=997 block log quick on $vdsl_if from {69.50.160.0/19, 85.255.112.0/20 } to any # Allow traffic between physical internal networks (bridge0) #- pass in quick on $int_if0 all no state #- pass out quick on $int_if0 all no state #- pass in quick on $int_if1 all no state #- pass out quick on $int_if1 all no state # Allow traffic with physical internal networks pass quick on $int_if from ($int_if:network) to ($int_if:network) keep state pass in quick on $int_if from ($int_if:network) to ($vdsl_if) keep state pass quick on $int_if proto ipencap # Allow traffic with virtualbox internal network pass in quick on $int_if from {$virtualbox_net} to ($int_if) keep state pass in quick on $int_if from {$virtualbox_net} to ($vdsl_if) keep state pass out quick on $int_if from ($int_if) to {$virtualbox_net} keep state # Allow physical internal networks to go to INTERNET - caveat quick keyword #- pass in quick on $int_if proto tcp from $platon to any port $emule_tcp keep state queue p2p_upload #- pass in quick on $int_if proto udp from $platon to any port $emule_udp keep state queue p2p_upload pass in quick on $int_if from ($int_if:network) to any keep state # Allow virtualbox internal network to go to INTERNET pass in quick on $int_if from {$virtualbox_net} to any keep state # Allow all connections that I initiate to INTERNET pass out quick on $vdsl_if proto { tcp, udp, icmp } all keep state # Allow DNS and NTP (keep state take care of this) #- pass in quick on $vdsl_if proto udp from any port domain to ($vdsl_if) #- pass in quick on $vdsl_if proto udp from any port ntp to ($vdsl_if) port ntp # Allow resolvers on INTERNET to use our DNS #- pass in quick on $vdsl_if proto udp from any to ($vdsl_if) port domain # Allow IPv4 PING from everywhere pass quick inet proto icmp all icmp-type echoreq keep state # Allow NEEDFRAG from everywhere (see /usr/include/netinet/ip_icmp.h) pass in quick inet proto icmp all icmp-type unreach code needfrag keep state # Allow icmp6 from everywhere pass in quick inet6 proto icmp6 all # Reserved for test of mpd-4.3 with synproxy pass in quick on $vdsl_if proto tcp from any to ($vdsl_if) port 8000 flags S/SA synproxy state # Allow everyone on INTERNET to connect to the following services #--- NOTE --- synproxy can't work on $vdsl_if managed by mpd4.2.2 / mpd4.3 pass in quick on $vdsl_if proto tcp from any to ($vdsl_if) port ssh flags S/SA keep state (source-track rule, max-src-conn-rate 1/30, overload <bad_hosts> flush global) pass in quick on $vdsl_if proto tcp from any to ($vdsl_if) port smtp flags S/SA keep state (source-track rule, max-src-conn-rate 15/10, overload <bad_hosts> flush global) pass in quick on $vdsl_if proto tcp from any to ($vdsl_if) port $tcp_services flags S/SA keep state # Services in IPv6 (no IPv4 routing in tunnel) pass in quick on $tunnel_if proto tcp from any to any port $tcp_services flags S/SA keep state # Allow BitTorrent traffic with INTERNET pass in quick on $vdsl_if proto tcp from any to ($vdsl_if) port $bittorrent flags S/SA keep state # Allow Donkey traffic with INTERNET pass in quick on $vdsl_if proto tcp from any to ($vdsl_if) port $donkey_tcp flags S/SA keep state pass in quick on $vdsl_if proto udp from any to ($vdsl_if) port $donkey_udp keep state # Allow ekiga traffic (on meribel - see rdr) pass in quick on $vdsl_if proto udp from any to any port 5000:5100 keep state pass in quick on $vdsl_if proto tcp from any to any port 1720 keep state pass out quick on $int_if proto udp from any to any port 5000:5100 keep state pass out quick on $int_if proto tcp from any to any port 1720 keep state # Allow IPsec pass in quick on $vdsl_if proto udp from any to ($vdsl_if) port isakmp keep state pass out quick on $vdsl_if proto udp from ($vdsl_if) to any port isakmp keep state pass in quick on $vdsl_if proto udp from any to ($vdsl_if) port 4500 keep state pass in quick on $vdsl_if proto ah from any to ($vdsl_if) pass out quick on $vdsl_if proto ah from ($vdsl_if) to any pass in quick on $vdsl_if proto esp from any to ($vdsl_if) pass out quick on $vdsl_if proto esp from ($vdsl_if) to any # Allow IPv6 with tignes - tunnelling without encryption pass quick on $vdsl_if inet proto ipv6 pass quick on $tunnel_if inet6 # Allow IPv4 with tignes - tunnelling without encryption pass in quick on $vdsl_if proto ipencap from $tignes to ($vdsl_if) pass out quick on $vdsl_if proto ipencap from ($vdsl_if) to $tignes pass quick on $tunnel_if inet # Allow IPV6 with internal networks pass quick on $int_if inet6 # assign packets to a queue. #pass out on $ext_if from 192.168.0.0/24 to any keep state queue developers #pass out on $ext_if from 192.168.1.0/24 to any keep state queue marketing
The problem crop up: [root@norquay ~]# pfctl -si Status: Enabled for 1 days 08:09:42 Debug: Urgent Interface Stats for ng0 IPv4 IPv6 Bytes In 3355637698 0 Bytes Out 236586554 0 Packets In Passed 2587532 0 Blocked 3290 0 Packets Out Passed 2395320 0 Blocked 109 0 State Table Total Rate current entries 31 searches 10992548 94.9/s inserts 77585 0.7/s removals 77052 0.7/s Counters match 86805 0.7/s bad-offset 0 0.0/s fragment 0 0.0/s short 0 0.0/s normalize 0 0.0/s memory 0 0.0/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 0 0.0/s proto-cksum 0 0.0/s state-mismatch 9 0.0/s state-insert 2 0.0/s state-limit 0 0.0/s src-limit 8 0.0/s synproxy 105 0.0/s map-failed 0 0.0/s [root@norquay ~]# pfctl -ss|wc -l 533 [root@norquay ~]# procstat -kk 7 PID TID COMM TDNAME KSTACK 7 100084 pf purge - mi_switch+0x118 sleepq_timedwait+0x40 _sleep+0x268 pf_purge_thread+0xec fork_exit+0x94 [root@norquay dtrace]# ./pf.dtrace dtrace: script './pf.dtrace' matched 4 probes dtrace: buffer size lowered to 2m after: [root@norquay ~]# echo "set timeout interval 5" | pfctl -mf - CPU ID FUNCTION:NAME 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry .... [root@norquay ~]# echo "set timeout interval 10" | pfctl -mf - And then no new state are created! [root@norquay ~]# nohup service pf restart resume the normal work of pf.
(In reply to hlh from comment #22) Did you take that '# pfctl -si' before attempting any of the workarounds? It shows you've got 31 states, I don't see how you'd hit the state limit that way. It might be interesting to keep dtrace running while you run the workaround. Everything I see so far suggests that we're stuck in the rw_sleep() (which was changed to an sx_sleep() in CURRENT), but that makes no sense. I don't see how that could happen, or how the workaround could help there. Can you confirm you're not running low on memory? That might conceivably also trigger the 'PF states limit reached' warning (although I'd expect to see many other warnings too), but I'm not sure how that would look like a frozen purge thread.
I run the pfctl -si after I detected the problem (with pftop) and before attempting any workaround. Here is the typical top: last pid: 38563; load averages: 0.15, 0.28, 0.25 up 7+14:37:46 08:20:09 67 processes: 1 running, 64 sleeping, 2 zombie CPU: 0.0% user, 0.0% nice, 0.2% system, 0.1% interrupt, 99.7% idle Mem: 49M Active, 84M Inact, 76M Laundry, 1728M Wired, 27M Free ARC: 571M Total, 63M MFU, 358M MRU, 4934K Anon, 6512K Header, 138M Other 315M Compressed, 520M Uncompressed, 1.65:1 Ratio Swap: 4096M Total, 285M Used, 3811M Free, 6% Inuse I don't see the 'PF states limit reached'. I run the dtrace when I detected the problem. The dtrace show nothing and start showing the trace after I run the first workaround.
(In reply to hlh from comment #24) Wait? You don't see 'PF states limit reached'? What are you seeing, exactly, and how do you detect it?
The first time I detected this problem was when a computer was not allow a connection to the internet. I check the gateway (the pine64 running CURRENT) and find the 'PF states limit reached' in /var/log/messages. Then I run pftop and see that there was a huge number of states. Reboot the gateway solved the problem. I dig further and find the workaround. I add set limit { states 30000, src-nodes 20000, frags 20000 } to /etc/pf.conf. Then I regularly check with pftop. For more than one week, no problem. But I continue to check and it occurs again. I have to check only from time to time because even when the problem arise, the limit of 30000 is large enough to allow for new connections to be established for some time...
(In reply to hlh from comment #26) I don't understand why your 'pfctl -si' would then only show 31 states. Either way, it would be interesting to run the dtrace script constantly (saving the output), and see what's in that log when the problem occurs. I'd expect it to stop showing updates when you end up in this state, and restart when you trigger the workaround. The interesting bit would be the last line(s) before it blocks, and the first line(s) when it resumes.
I think that the `pfctl -si` shows the counters just when the problem start. I'm running dtrace from now on.
>>I don't understand why your 'pfctl -si' would then only show 31 states. >I think that the `pfctl -si` shows the counters just when the problem start. It's the same for me. As far as I remember, this behavior is easily explained by the fact that these counters are updated regularly by the purge thread, so ... yeah. Frozen purge thread means frozen counters.
(In reply to noah.bergbauer from comment #29) Having checked the code, yes, I believe you're correct about that.
(In reply to Kristof Provost from comment #27) When the problem occurs, no more line in the dtrace log. Here are the last 20 lines: [root@norquay ~]# tail -n 20 /var/log/pf.dtrace.log-before 0 2257 pf_purge_expired_states:entry 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return 0 2257 pf_purge_expired_states:entry 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 3 2257 pf_purge_expired_states:entry 3 2258 pf_purge_expired_states:return 3 2258 pf_purge_expired_states:return 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return then I run: [root@norquay ~]# echo "set timeout interval 5" | pfctl -mf - And the dtrace log resume; the first 20 lines are: [root@norquay ~]# head -n 20 /var/log/pf.dtrace.log CPU ID FUNCTION:NAME 1 2257 pf_purge_expired_states:entry 1 2258 pf_purge_expired_states:return 1 2258 pf_purge_expired_states:return 0 2257 pf_purge_expired_states:entry 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return 0 2257 pf_purge_expired_states:entry 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return 0 2257 pf_purge_expired_states:entry 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return 0 2257 pf_purge_expired_states:entry 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return 0 2257 pf_purge_expired_states:entry 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return 0 2258 pf_purge_expired_states:return PS I have the 2 log files and we can go back more then 20 lines.
(In reply to hlh from comment #31) It's a little odd that you're seeing double pf_purge_expired_states:return entries. Any chance you've got two such probes in your dtrace script? Anyway, let's stick a couple of static probes in and see what's going on: diff --git a/sys/netpfil/pf/pf.c b/sys/netpfil/pf/pf.c index 8613a161f0a..f8244a6ef6e 100644 --- a/sys/netpfil/pf/pf.c +++ b/sys/netpfil/pf/pf.c @@ -55,6 +55,7 @@ __FBSDID("$FreeBSD$"); #include <sys/md5.h> #include <sys/random.h> #include <sys/refcount.h> +#include <sys/sdt.h> #include <sys/socket.h> #include <sys/sysctl.h> #include <sys/taskqueue.h> @@ -105,6 +106,14 @@ __FBSDID("$FreeBSD$"); #define DPFPRINTF(n, x) if (V_pf_status.debug >= (n)) printf x +/* DTrace static probes */ +SDT_PROVIDER_DEFINE(pf); + +SDT_PROBE_DEFINE(pf, purge, thread, wakeup); +SDT_PROBE_DEFINE2(pf, purge, , expired_states, + "unsigned int", + "int"); + /* * Global variables */ @@ -1434,6 +1443,7 @@ pf_purge_thread(void *unused __unused) sx_xlock(&pf_end_lock); while (pf_end_threads == 0) { sx_sleep(pf_purge_thread, &pf_end_lock, 0, "pftm", hz / 10); + SDT_PROBE0(pf, purge, thread, wakeup); VNET_LIST_RLOCK(); VNET_FOREACH(vnet_iter) { @@ -1680,6 +1690,8 @@ pf_purge_expired_states(u_int i, int maxcheck) V_pf_status.states = uma_zone_get_cur(V_pf_state_z); + SDT_PROBE2(pf, purge, , expired_states, i, maxcheck); + /* * Go through hash and unlink states that expire now. */ You can trace those with: #!/usr/sbin/dtrace -s pf:purge:thread:wakeup { } pf:purge::expired_states { printf("i %d maxentry %d %d", arg0, arg1, arg2); } Hopefully we'll get a clue as to what's going on with this.
(In reply to Kristof Provost from comment #32) Here is the dtrace script used: #!/usr/sbin/dtrace -s fbt:kernel:pf_purge_expired_states:entry { } fbt:kernel:pf_purge_expired_states:return { } fbt:kernel:pf_purge_expired_fragments:entry { } fbt:kernel:pf_purge_thread:entry { } I'm doing the changes you ask.
(In reply to Kristof Provost from comment #32) I check the troubling number of return: [root@norquay log]# grep entry pf.dtrace1.log-before |wc -l 2026647 [root@norquay log]# grep return pf.dtrace1.log-before |wc -l 4021723 [root@norquay log]# bc 2026647*2 4053294 4053294-4021723 31571 Really strange!
(In reply to hlh from comment #33) The problem crops up again. Here is the end of the output of your dtrace script when the states are no more expired: [root@norquay log]# tail pf.dtrace2.log-after 3 9078 none:expired_states i 19947 maxentry 327 0 3 9077 thread:wakeup 3 9078 none:expired_states i 20274 maxentry 327 0 0 9077 thread:wakeup 0 9078 none:expired_states i 20601 maxentry 327 0 0 9077 thread:wakeup 0 9078 none:expired_states i 20928 maxentry 327 0 0 9077 thread:wakeup 0 9078 none:expired_states i 21255 maxentry 327 0 The dtrace script give no more output. Then I stop the dtrace and save the previous trace. I restart a new dtrace which output nothing until I run: echo "set timeout interval 5" | pfctl -mf - echo "set timeout interval 5" | pfctl -mf - nohup service pf restart The first lines of the dtrace are: [root@norquay log]# head pf.dtrace2.log-after CPU ID FUNCTION:NAME 0 9077 thread:wakeup 0 9078 none:expired_states i 23544 maxentry 655 0 1 9077 thread:wakeup 1 9078 none:expired_states i 24199 maxentry 655 0 1 9077 thread:wakeup 1 9078 none:expired_states i 24854 maxentry 655 0 1 9077 thread:wakeup 1 9078 none:expired_states i 25509 maxentry 655 0 1 9077 thread:wakeup
I notice that when I do: echo "set timeout interval 10" | pfctl -mf - while pf is running without problem, then no more state are created. I must run `nohup service pf restart`to restore a functional system.
I do not understand this at all. There is no reason for the purge thread to get stuck. There's also no reason for it apparently having run 7 times between freezing and when you restart the dtrace. (That's based on the purge_idx value logged by the expired_states probe. 23544 - 21255 == 327 * 7. (Please just don't stop it at all while you're performing the workaround. There's no reason for it, and stopping it runs the risk of missing interesting information). It still looks like we just don't wake up from the sx_sleep(pf_purge_thread, &pf_end_lock, ...) call, but that makes no sense. The only way for that to happen is if someone would be holding the pf_end_lock, but that lock is only taken when the pf module is unloaded (and by the purge thread, obviously). Do you see other strange behaviour on the system? If there's something wrong with the sleep/lock infrastructure I'd expect to see other strange things.
(In reply to Kristof Provost from comment #37) The only thing that seems strange to me is the memory used on this system: top show: last pid: 37231; load averages: 0.19, 0.20, 0.16 up 3+17:01:05 08:04:01 73 processes: 1 running, 71 sleeping, 1 zombie CPU: 0.1% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.7% idle Mem: 69M Active, 76M Inact, 57M Laundry, 1727M Wired, 35M Free ARC: 597M Total, 70M MFU, 195M MRU, 1868K Anon, 4274K Header, 326M Other 55M Compressed, 212M Uncompressed, 3.84:1 Ratio Swap: 4096M Total, 356M Used, 3740M Free, 8% Inuse zfs-stats -M: ------------------------------------------------------------------------ ZFS Subsystem Report Mon Oct 23 08:04:34 2017 ------------------------------------------------------------------------ System Memory: 3.53% 69.41 MiB Active, 3.87% 76.07 MiB Inact 87.95% 1.69 GiB Wired, 0.00% 0 Cache 1.72% 33.71 MiB Free, 2.93% 57.48 MiB Gap Real Installed: 2.00 GiB Real Available: 98.34% 1.97 GiB Real Managed: 97.50% 1.92 GiB Logical Total: 2.00 GiB Logical Used: 94.64% 1.89 GiB Logical Free: 5.36% 109.78 MiB Kernel Memory: 147.94 MiB Data: 86.01% 127.24 MiB Text: 13.99% 20.70 MiB Kernel Memory Map: 512.00 MiB Size: 53.68% 274.86 MiB Free: 46.32% 237.14 MiB ------------------------------------------------------------------------ zfs-stats -A ------------------------------------------------------------------------ ZFS Subsystem Report Mon Oct 23 08:05:18 2017 ------------------------------------------------------------------------ ARC Summary: (HEALTHY) Memory Throttle Count: 0 ARC Misc: Deleted: 4.39m Recycle Misses: 0 Mutex Misses: 24.67m Evict Skips: 4.15b ARC Size: 233.77% 598.44 MiB Target Size: (Adaptive) 100.00% 256.00 MiB Min Size (Hard Limit): 12.50% 32.00 MiB Max Size (High Water): 8:1 256.00 MiB ARC Size Breakdown: Recently Used Cache Size: 22.91% 137.09 MiB Frequently Used Cache Size: 77.09% 461.35 MiB ARC Hash Breakdown: Elements Max: 41.80k Elements Current: 38.41% 16.06k Collisions: 448.17k Chain Max: 4 Chains: 471 ------------------------------------------------------------------------ Why ARC at 233% and 1727Mb wired?
(In reply to Kristof Provost from comment #37) I'm really sorry because in my previous post I display the tail of dtrace2.log-after instead of dtrace2.log-before. I have deleted those files; Shame on me! Let's wait the next problem...
(In reply to hlh from comment #33) When I run the dtrace in batch, it end freqently with: dtrace: script './pf.dtrace2' matched 2 probes dtrace: buffer size lowered to 2m dtrace: processing aborted: Abort due to systemic unresponsiveness and so have no valuable trace when the problem arise. PS I'm now running FreeBSD norquay.restart.bel 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r322941:324563M: Tue Oct 24 16:18:12 CEST 2017 root@norquay.restart.bel:/usr/obj/usr/src/sys/NORQUAY arm64 And the problem crop up more freqently.
(In reply to hlh from comment #38) > Why ARC at 233% and 1727Mb wired? The ARC now keeps compressed data, so you can fit more in memory. I suspect that's a result of that.
(In reply to hlh from comment #40) > dtrace: processing aborted: Abort due to systemic unresponsiveness You'd have to talk to a dtrace specialist about that. I'm not sure what causes that. I still don't see this issue on my current boxes, so I don't know what I can do right now.
(In reply to Kristof Provost from comment #42) I upgrade the PINE64 to CURRENT 324563 and now ntpd can't keep the clock for a long time - after one hour or 2, the clock run too fast. I think that this clock drift is correlated with the pf problem. Does pf is sensitive to a clock running too fast and beeing reset backward by ntpd?
(In reply to hlh from comment #43) No, that shouldn't matter, unless the clock is completely wrong (i.e. not running, or running orders of magnitude slower than expected, or running backwards, ...)
We had the exactly same problem after upgrading from 10.3 to 11.1 Since this was the only system with an older vmware esxi, we upgraded the esxi host to a newer version. Surprise, the problem went away... :-)
(In reply to Kristof Provost from comment #44) I can't get the dtrace you ask (dtrace after the problem occurs) I upgrade to r328259 and the problem always crops up when I get a time drift (time jump 3 minutes in the future).
(In reply to hlh from comment #46) Are you also using VMWare esxi?
(In reply to Kristof Provost from comment #47) No. I am running on a Pine64+ 2GB See http://docs.freebsd.org/cgi/mid.cgi?fdf7ea24-6ada-dbb1-146a-0bf6fde11d29
At least in my particular case, I eventually tracked this down to a tunable from loader.conf: kern.timecounter.smp_tsc_adjust=1 Since I removed that, this issue hasn't happened once so I believe it might have been the reason.
That pretty much confirms this isn't actually a pf bug, but a time keeping issue. I'm afraid I can't help with this one.
I am seeing the same thing on: FreeBSD gate2.funkthat.com 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r336134: Mon Jul 9 19:20:11 UTC 2018 root@releng3.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm64 This is a Pine A64-LTS board. Proposed loader.conf does not work for me as I do not have that line present. This is an otherwise basic configuration w/o anything special set.
Is it possible to get the output from "procstat -ak" when the lockup happens? Is the Giant mutex involved here? --HPS
This is an aarch64 (or pine64) problem, not a pf issue.
Created attachment 196421 [details] run a bunch of threads, and detect if sleep fails to wakeup This is a program I wrote to verify that sleep was working, and in my case it was not. There is current patches attached to 229644 that are believed to fix the issue. Attaching the program so others can find it if this problem crops up again. Noah, you didn't include the platform of your system, so I'm going to assume that either you resolved it, or it's no longer an issue unless you comment on the bug.
This should be fixed now, see bug 229644 with commit: https://svnweb.freebsd.org/changeset/base/338272
I think, i have same issue. uname -a FreeBSD *** 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r335510: Fri Jun 22 04:32:14 UTC 2018 root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 sudo procstat -kk 342 PID TID COMM TDNAME KSTACK 342 100342 pf purge - mi_switch+0xe6 sleepq_timedwait+0x2f _sleep+0x21e pf_purge_thread+0x86 fork_exit+0x83 fork_trampoline+0xe pftop show thousands of FIN_WAIT_2:FIN_WAIT_2 with EXP 00:00:00 and AGE > 00:24:00 How i can fix this problem ?
(In reply to skillcoder from comment #56) That's odd, we've only seen this on arm64 so far. Can you run the Dtrace script from comment #1 to confirm that your purge thread really isn't waking up?
# sudo ./pf.dtrace dtrace: script './pf.dtrace' matched 8 probes wait some minutes: ^T load: 0.13 cmd: dtrace 10177 [uwait] 159.76r 0.12u 0.06s 0% 24064k
It's default kernel on Hertzner CX10 with 1GiB RAM sysctl kern.hz kern.hz: 100 sudo pfctl -vsi No ALTQ support in kernel ALTQ related functions disabled Status: Enabled for 0 days 01:18:30 Debug: Urgent Hostid: 0x4b6112e8 Checksum: 0x6f02d61c617ad47b85858a2708d6bf50 State Table Total Rate current entries 1169 searches 1495604 317.5/s inserts 40476 8.6/s removals 10639 2.3/s Source Tracking Table current entries 0 searches 0 0.0/s inserts 0 0.0/s removals 0 0.0/s Counters match 47883 10.2/s bad-offset 0 0.0/s fragment 0 0.0/s short 0 0.0/s normalize 0 0.0/s memory 0 0.0/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 0 0.0/s proto-cksum 5120 1.1/s state-mismatch 31 0.0/s state-insert 0 0.0/s state-limit 0 0.0/s src-limit 0 0.0/s synproxy 0 0.0/s map-failed 0 0.0/s Limit Counters max states per rule 0 0.0/s max-src-states 0 0.0/s max-src-nodes 0 0.0/s max-src-conn 0 0.0/s max-src-conn-rate 0 0.0/s overload table insertion 0 0.0/s overload flush states 0 0.0/s sudo pfctl -ss | awk '{print $6}' | sort | uniq -c | sort -nr | head 27112 FIN_WAIT_2:FIN_WAIT_2 2237 TIME_WAIT:TIME_WAIT 224 MULTIPLE:SINGLE 188 ESTABLISHED:ESTABLISHED 177 FIN_WAIT_2:ESTABLISHED 49 SYN_SENT:ESTABLISHED 4 SYN_SENT:CLOSED 4 SINGLE:NO_TRAFFIC 3 MULTIPLE:MULTIPLE 2 0:0
After using this echo "set timeout interval 5" | sudo pfctl -mf - CPU ID FUNCTION:NAME 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry 0 65925 pf_purge_expired_states:return 0 65924 pf_purge_expired_states:entry ^C And now sudo pfctl -ss | awk '{print $6}' | sort | uniq -c | sort -nr | head 162 ESTABLISHED:ESTABLISHED 40 FIN_WAIT_2:ESTABLISHED 10 FIN_WAIT_2:FIN_WAIT_2 9 TIME_WAIT:TIME_WAIT 1 ESTABLISHED:FIN_WAIT_2
That's very strange. The symptoms are the same, but as far as I know we've never had issues with the timer on amd64 (and other users would have complained too). Can you give the latest update a try? From the looks of things you're running an unpatched 11.2.
(In reply to hlh from comment #48) Since my upgrade to 12.0 r338128 and using a10_timer (see pr229644) the pf problem disappear for me since 2018-08-21.
I have encountered this very same problem on an ASUS RS500A-E10-RS12U (ASUS KRPA-U16 mainboard) using an AMD EPYC 7502 CPU, 12.2-STABLE r366526. The workaround is to use the HPET timer instead of TSC-low. Live system: sysctl kern.timecounter.hardware="HPET" /etc/sysctl.conf: kern.timecounter.hardware="HPET"
I actually just hit this after a 12->13 upgrade. No PF states were being aged out. Changing to HPET fixed this for me, too. FreeBSD dax.prolixium.com 13.0-STABLE FreeBSD 13.0-STABLE #0 stable/13-n247299-aac5428f48e: Wed Sep 15 22:44:48 EDT 2021 prox@dax.prolixium.com:/usr/obj/usr/src/amd64.amd64/sys/DAX amd64 (dax:6:16:EDT)% tail -n2 /etc/sysctl.conf # fix https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222126 kern.timecounter.hardware=HPET Previously this was set to TSC-low. This host is a VM running within VirtualBox, FWIW.