I occasionally observe temporary failures on my public UDP ports. The problem seems to be with libalias when used for UDP port_redirect: ipfw nat 3 config log same_ports unreg_only ip <public-ip> \ redirect_port udp 192.168.xx.xx:5007 5006 \ redirect_port tcp 192.168.xx.xx:5007 5006 02420 0 0 count log proto ip4 src-port 64000-64010 02425 0 0 nat 3 proto ip4 02430 0 0 count log proto ip4 src-port 64000-64010 Then randomly # nc -u4p 64000 <public-ip> 5006 # nc -u4p 64001 <public-ip> 5006 # nc -u4p 64002 <public-ip> 5006 Feb 23 03:51:45 <security.info> edge kernel: [75643] ipfw-oper: 2420 Count UDP 91.12.117.156:64000 <public-ip>:5006 in via tun3 Feb 23 03:51:45 <security.info> edge kernel: [75643] ipfw-oper: 2430 Count UDP 91.12.117.156:64000 192.168.xx.xx:5007 in via tun3 Feb 23 03:52:08 <security.info> edge kernel: [75666] ipfw-oper: 2420 Count UDP 91.12.117.156:64001 <public-ip>:5006 in via tun3 Feb 23 03:52:08 <security.info> edge kernel: [75666] ipfw-oper: 2430 Count UDP 91.12.117.156:64001 192.168.xx.xx:5007 in via tun3 Feb 23 03:52:29 <security.info> edge kernel: [75687] ipfw-oper: 2420 Count UDP 91.12.117.156:64001 <public-ip>:5006 in via tun3 Feb 23 03:52:29 <security.info> edge kernel: [75687] ipfw-oper: 2430 Count UDP 91.12.117.156:64001 192.168.xx.xx:5007 in via tun3 Feb 23 03:52:36 <security.info> edge kernel: [75694] ipfw-oper: 2420 Count UDP 91.12.117.156:64001 51.158.21.23:5006 in via tun3 Feb 23 03:52:36 <security.info> edge kernel: [75694] ipfw-oper: 2430 Count UDP 91.12.117.156:64001 192.168.98.18:5007 in via tun3 Feb 23 03:52:56 <security.info> edge kernel: [75714] ipfw-oper: 2420 Count UDP 91.12.117.156:64002 <public-ip>:5006 in via tun3 Feb 23 03:52:56 <security.info> edge kernel: [75714] ipfw-oper: 2430 Count UDP 91.12.117.156:64002 192.168.xx.xx:5007 in via tun3 Feb 23 03:53:06 <security.info> edge kernel: [75724] ipfw-oper: 2420 Count UDP 91.12.117.156:64000 <public-ip>:5006 in via tun3 Feb 23 03:53:06 <security.info> edge kernel: [75724] ipfw-oper: 2430 Count UDP 91.12.117.156:64000 <public-ip>:5006 in via tun3 Feb 23 03:53:06 <security.info> edge kernel: [75724] ipfw-oper: 2685 Unreach 13 UDP 91.12.117.156:64000 <public-ip>:5006 in via tun3 In the beginning 64000 and 64001 get through. Now only using 64001 for some time, and then starting to use 64002, suddenly 64000 is no longer translated. (Not always happens in that exact sequence.) # ipfw nat show log nat 2: icmp=32, udp=1, tcp=40, sctp=0, pptp=0, proto=0, frag_id=0 frag_ptr=0 / tot=73 nat 3: icmp=0, udp=2, tcp=72, sctp=0, pptp=0, proto=0, frag_id=0 frag_ptr=0 / tot=74 These udp= counters are changing meanwhile, but I don't get a clue from it, and didn't find a way to log details. Platform 13.2-BETA2 (but the problem is older)
Created attachment 248673 [details] DTRACE cmdline for general amusement and bookkeping - I mean, to observe what libalias adds and removes from it's database
I managed to analyze this a bit further: With an UDP service (e.g. VPN) portforwarded behind NAT there is no notion of a connection, and apparently it can happen that libalias creates TWO distinct records for inbound and outbound traffic, with the inbound record lacking the portforward (and the received packets then dumped as undeliverable on the NATing node). The problem always resolves itself after either 12 or 14 minutes, apparently due to internal timings. Observed with dtrace (cmdline is attached): 203.0.113.1 is my public address where NAT aliases to 198.51.100.1 is the address of the connecting client from outside 192.168.1.12 is the internal address of the VPN server (where portforward goes to) -> At this time we have a packet client->server, two server->client and another client->server, all successful 18 17832 DeleteLink:entry openvpn 1708555176 icmp=5 udp=2 tcp=108 type=17 (src_addr 192.168.1.12) (dst_addr 0.0.0.0) (alias_addr 0.0.0.0) (proxy_addr 0.0.0.0) (src_port 5007) (dst_port 0) (alias_port 5006) (proxy_port 0) -> For whatever reason libalias tries to delete some UDP, but the next line shows that nothing was actually deleted. So this would apparently be the permanent portforward, there is no other codepath 18 49947 AddLink:entry openvpn 1708555180 icmp=5 udp=2 tcp=108 -> At this time we get a packet client->server, and it is the first failing (it gets processed by libalias, but NOT portforwarded (in the next second we also get packets in the other direction, so without fractional seconds logged we do not know when exactly what happened) 18 17832 DeleteLink:entry openvpn 1708555186 icmp=5 udp=2 tcp=111 type=17 (src_addr 192.168.1.12) (dst_addr 198.51.100.1) (alias_addr 0.0.0.0) (proxy_addr 0.0.0.0) (src_port 5007) (dst_port 8211) (alias_port 5006) (proxy_port 0) -> Here the portforward got deleted (but WHY? It is continuousley used, pings every 10 seconds), and udp count goes to 1. 18 49947 AddLink:entry openvpn 1708555186 icmp=5 udp=1 tcp=111 (src_addr 203.0.113.1) (dst_addr 91.62.14.110) (alias_addr 203.0.113.1) (src_port 5006) 18 49948 AddLink:return openvpn 1708555186 icmp=5 udp=2 tcp=111 type=17 (src_addr 203.0.113.1) (dst_addr 198.51.100.1) (alias_addr 203.0.113.1) (proxy_addr 0.0.0.0) (src_port 5006) (dst_port 8211) (alias_port 5006) (proxy_port 0) -> And immediately added again, but now WITHOUT the portforward!! 19 17832 DeleteLink:entry suricata 1708555186 icmp=5 udp=2 tcp=111 type=17 (src_addr 192.168.1.12) (dst_addr 0.0.0.0) (alias_addr 0.0.0.0) (proxy_addr 0.0.0.0) (src_port 5007) (dst_port 0) (alias_port 5006) (proxy_port 0) -> Apparently the housekeeping is too aggressive... 19 49947 AddLink:entry suricata 1708555186 icmp=5 udp=2 tcp=111 (src_addr 192.168.1.12) (dst_addr 198.51.100.1) (alias_addr 0.0.0.0) (src_port 5007) 19 49948 AddLink:return suricata 1708555186 icmp=5 udp=3 tcp=111 type=17 (src_addr 192.168.1.12) (dst_addr 198.51.100.1) (alias_addr 0.0.0.0) (proxy_addr 0.0.0.0) (src_port 5007) (dst_port 8211) (alias_port 5006) (proxy_port 0) -> And now we get the portforward as a *SEPARATE* record (udp went to 3!) 19 17832 DeleteLink:entry suricata 1708555186 icmp=5 udp=3 tcp=111 type=17 (src_addr 192.168.1.12) (dst_addr 0.0.0.0) (alias_addr 0.0.0.0) (proxy_addr 0.0.0.0) (src_port 5007) (dst_port 0) (alias_port 5006) (proxy_port 0) -> groundhog day again
It seems I faced same problem. In my case, it is redirect_port udp for the port 514 (syslog) that is unidirected UDP stream. And no self-healing in my case, it just stops translating udp/514 forever, until "ipfw nat 123 config ..." command re-issued.
Reissuing the "ipfw nat config ..." does not help in my case. Switching from "redirect_port udp 192.168.xx.xx:5007 5006" to "redirect_port udp 192.168.xx.xx:5007 yy.yy.yy.yy:5006" has occasionally helped, but then again not.
Trying with /sbin/natd gives similar errors. Also, trying ng_nat gives similar errors. The latter allows to handle inflow and outflow separately, so with UDP I can use entirely different nat instances for both - which allows me to confirm Eugene's finding: this issue happens on pure unidirectional inflow.
Hunting it down - the problem is a race condition: libalias has an internal representation of a flow (called a "temporary link"). These can expire and get deleted - the logic when and how this happens is unintellegible, but there is a time gap between expiry and deletion. When another packet arrives from outside during this time gap - which is unlikely in TCP, but likely with UDP (like syslog sending another message after some pause, or VPN restarting after reboot) - libalias finds the old "temporary link", detects that it is already expired, and deletes it - but does then not anymore check for a configured portforward for this flow. So the newly installed "temporary link" is without portforwarding, and will continue to exist until perchance traffic ceases and it might get expired.
(In reply to Peter Much from comment #6) > the logic when and how this happens is unintellegible In case of ipfw nat: 1) When "ipfw nat" rule processes packets in-kernel, a packet is passed to ipfw_nat_ptr() that points to ipfw_nat() function in sys/netpfil/ipfw/ip_fw_nat.c 2) ipfw_nat() can call LibAliasOut(), LibAliasOutTry() or LibAliasIn(). 2.1) LibAliasIn() calls LibAliasInLocked(). 2.2) LibAliasOut() and LibAliasOutTry() call LibAliasOutLocked() than can call LibAliasInLocked(), too. 3) So we end up in LibAliasInLocked() or LibAliasOutLocked() and both call HouseKeeping() function sometimes (not every time) calls CleanupLink(la, &lnk, 0) to remove expired links but it is not supposed to touch permanent links we have configured for port forwarding.
Created attachment 252840 [details] workaround/fix In libalias/alias_db.c:_FindLinkIn() is a search for a temporary (fully specified) flow, and on success it does return (UseLink(la, lnk)); Further down in this function is another search for a permanent (partially specified) portforward. If we return through UseLink(), this then does CleanupLink(la, &lnk, 0) and there is a check for expiry, and then DeleteLink(lnk, ...) So from there, if the link is expired, we get a NULL back, and the permanent portforward is not searched for. I fix this by delaying the CleanupLink() invocation to after the permanent link has been searched for. And since this would add another search on potentially every packet, I delay it only if the temporary link has expired.
(In reply to Eugene Grosbein from comment #7) Okay, thanks! Can You try my patch?
(In reply to Peter Much from comment #9) I'll try. I cannot reliably reproduce the problem in my setup, so it will take time.
I've managed to reproduce the problem and came up with simplier and a bit more efficient patch.
Created attachment 252849 [details] proposed fix
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8132e959099f0c533f698d8fbc17386f9144432f commit 8132e959099f0c533f698d8fbc17386f9144432f Author: Eugene Grosbein <eugen@FreeBSD.org> AuthorDate: 2024-08-19 03:34:37 +0000 Commit: Eugene Grosbein <eugen@FreeBSD.org> CommitDate: 2024-08-19 03:34:37 +0000 libalias: fix subtle racy problem in outside-inside forwarding sys/netinet/libalias/alias_db.c has internal static function UseLink() that passes a link to CleanupLink() to verify if the link has expired. If so, UseLink() may return NULL. _FindLinkIn()'s usage of UseLink() is not quite correct. Assume there is "redirect_port udp" configured to forward incoming traffic for specific port to some internal address. Such a rule creates partially specified permanent link. After first such packet libalias creates new fully specifiled temporary LINK_UDP with default timeout 60 seconds. Also, in case of low traffic libalias may assign "timestamp" for this new temporary link way in the past because LibAliasTime is updated seldom and can keep old value for tens of seconds, and it will be used for the temporary link. It may happen that next incoming packet for redirected port passed to _FindLinkIn() results in a call to UseLink() that returns NULL due to detected expiration. Immediate return of NULL results in broken translation: either a packet is dropped (deny_incoming mode) or delivered to original destination address instead of internal one. Fix it with additional check for NULL to proceed with a search for original partially specified link. In case of UDP, it also recreates temporary fully specified link with a call to ReLink(). Practical examples are "redirect_port udp" rules for unidirectional SYSLOG protocol (port 514) or some low volume VPN encapsulated in UDP. Thanks to Peter Much for initial analysis and first version of a patch. Reported by: Peter Much <pmc@citylink.dinoex.sub.org> PR: 269770 MFC after: 1 week sys/netinet/libalias/alias_db.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
*** Bug 249387 has been marked as a duplicate of this bug. ***
(In reply to Eugene Grosbein from comment #11) Wait - I tried that one first, and after some time it crashed my kernel. So I concluded, that UseLink(), when deleting the link, modifies something in the linked-lists, and when later the temporary link is searched (utilizing the previousely obtained grp->partial value), something might conflict. I didn't bother to verify this assumption, or to exactly understand the ramifications of these linked-lists. Instead, I just changed the code to run the searches before actually doing the delete.
> Wait - I tried that one first, and after some time it crashed my kernel. Did you collect a crashdump? Are you sure it crashed in libalias? It could be LIBALIAS_LOCK_ASSERT() if you run custom kernel compiled with options INVARIANTS. I tested with GENERIC without INVARIANTS.
Created attachment 252939 [details] additional check
You seem to be right and "grp" need additional check after UseLink() that could remove it. Attached, take a look.
(In reply to Eugene Grosbein from comment #16) Yes and yes. One of two crashes wrote a dump (the other just went stuck). I looked at the backtrace, found it leading right into the functions we're dealing with, and deleted it in a mood of frustration. So sadly no backup. It was a custom production kernel for a very small KVM, no INVARIANTS or such. And it doesn't normally crash.
(In reply to Eugene Grosbein from comment #18) I did some digging in my git log. My first try, that did crash, had a flaw. The second try, that just got the system stuck offline (we don't know why, might be anything) should probably have worked, but it didn't bother to reserve an extra variable to return the result from UseLink(). So Your *first* patch is still cleaner than that - and that runs now on my KVM, and up to now it works.
(In reply to Peter Much from comment #20) Still, the code after my first patch may invalidate "intermediate node" (look for "free(grp)" in the Deletelink()), so if UseLink() returns NULL, the code needs second patch.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e5b85380836378c9e321a4e6d300591e6faf622a commit e5b85380836378c9e321a4e6d300591e6faf622a Author: Eugene Grosbein <eugen@FreeBSD.org> AuthorDate: 2024-08-20 14:00:35 +0000 Commit: Eugene Grosbein <eugen@FreeBSD.org> CommitDate: 2024-08-20 14:04:13 +0000 libalias: add another check to previous change If UseLink() returns NULL, it is possible that Deletelink() has already freed "grp", so check it out carefully. PR: 269770 Reported by: Peter Much X-MFC-With: 8132e959099f0c533f698d8fbc17386f9144432f sys/netinet/libalias/alias_db.c | 3 +++ 1 file changed, 3 insertions(+)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=2441180265509a7efe8c20ee72585757e7bb9d1c commit 2441180265509a7efe8c20ee72585757e7bb9d1c Author: Eugene Grosbein <eugen@FreeBSD.org> AuthorDate: 2024-08-19 03:34:37 +0000 Commit: Eugene Grosbein <eugen@FreeBSD.org> CommitDate: 2024-08-25 06:31:24 +0000 libalias: fix subtle racy problem in outside-inside forwarding sys/netinet/libalias/alias_db.c has internal static function UseLink() that passes a link to CleanupLink() to verify if the link has expired. If so, UseLink() may return NULL. _FindLinkIn()'s usage of UseLink() is not quite correct. Assume there is "redirect_port udp" configured to forward incoming traffic for specific port to some internal address. Such a rule creates partially specified permanent link. After first such incoming packet libalias creates new fully specifiled temporary LINK_UDP with default timeout of 60 seconds. Also, in case of low traffic libalias may assign "timestamp" for this new temporary link way in the past because LibAliasTime is updated seldom and can keep old value for tens of seconds, and it will be used for the temporary link. It may happen that next incoming packet for redirected port passed to _FindLinkIn() results in a call to UseLink() that returns NULL due to detected expiration. Immediate return of NULL results in broken translation: either a packet is dropped (deny_incoming mode) or delivered to original destination address instead of internal one. Fix it with additional check for NULL to proceed with a search for original partially specified link. In case of UDP, it also recreates temporary fully specified link with a call to ReLink(). Practical examples are "redirect_port udp" rules for unidirectional SYSLOG protocol (port 514) or some low volume VPN encapsulated in UDP. Thanks to Peter Much for initial analysis and first version of a patch. Reported by: Peter Much <pmc@citylink.dinoex.sub.org> PR: 269770 (cherry picked from commit 8132e959099f0c533f698d8fbc17386f9144432f) (cherry picked from commit e5b85380836378c9e321a4e6d300591e6faf622a) sys/netinet/libalias/alias_db.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f20a1805f7dace1e4fcfe25e26e552a9ec7ede74 commit f20a1805f7dace1e4fcfe25e26e552a9ec7ede74 Author: Eugene Grosbein <eugen@FreeBSD.org> AuthorDate: 2024-08-19 03:34:37 +0000 Commit: Eugene Grosbein <eugen@FreeBSD.org> CommitDate: 2024-08-25 06:42:12 +0000 libalias: fix subtle racy problem in outside-inside forwarding sys/netinet/libalias/alias_db.c has internal static function UseLink() that passes a link to CleanupLink() to verify if the link has expired. If so, UseLink() may return NULL. _FindLinkIn()'s usage of UseLink() is not quite correct. Assume there is "redirect_port udp" configured to forward incoming traffic for specific port to some internal address. Such a rule creates partially specified permanent link. After first such incoming packet libalias creates new fully specified temporary LINK_UDP with default timeout of 60 seconds. Also, in case of low traffic libalias may assign "timestamp" for this new temporary link way in the past because LibAliasTime is updated seldom and can keep old value for tens of seconds, and it will be used for the temporary link. It may happen that next incoming packet for redirected port passed to _FindLinkIn() results in a call to UseLink() that returns NULL due to detected expiration. Immediate return of NULL results in broken translation: either a packet is dropped (deny_incoming mode) or delivered to original destination address instead of internal one. Fix it with additional check for NULL to proceed with a search for original partially specified link. In case of UDP, it also recreates temporary fully specified link with a call to ReLink(). Practical examples are "redirect_port udp" rules for unidirectional SYSLOG protocol (port 514) or some low volume VPN encapsulated in UDP. Thanks to Peter Much for initial analysis and first version of a patch. Reported by: Peter Much <pmc@citylink.dinoex.sub.org> PR: 269770 (cherry picked from commit 8132e959099f0c533f698d8fbc17386f9144432f) (cherry picked from commit e5b85380836378c9e321a4e6d300591e6faf622a) sys/netinet/libalias/alias_db.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=41efd8eef313448f63125dd96dfa393f5492edbd commit 41efd8eef313448f63125dd96dfa393f5492edbd Author: Eugene Grosbein <eugen@FreeBSD.org> AuthorDate: 2024-08-19 03:34:37 +0000 Commit: Eugene Grosbein <eugen@FreeBSD.org> CommitDate: 2024-08-25 06:47:32 +0000 libalias: fix subtle racy problem in outside-inside forwarding sys/netinet/libalias/alias_db.c has internal static function UseLink() that passes a link to CleanupLink() to verify if the link has expired. If so, UseLink() may return NULL. _FindLinkIn()'s usage of UseLink() is not quite correct. Assume there is "redirect_port udp" configured to forward incoming traffic for specific port to some internal address. Such a rule creates partially specified permanent link. After first such incoming packet libalias creates new fully specified temporary LINK_UDP with default timeout of 60 seconds. Also, in case of low traffic libalias may assign "timestamp" for this new temporary link way in the past because LibAliasTime is updated seldom and can keep old value for tens of seconds, and it will be used for the temporary link. It may happen that next incoming packet for redirected port passed to _FindLinkIn() results in a call to UseLink() that returns NULL due to detected expiration. Immediate return of NULL results in broken translation: either a packet is dropped (deny_incoming mode) or delivered to original destination address instead of internal one. Fix it with additional check for NULL to proceed with a search for original partially specified link. In case of UDP, it also recreates temporary fully specified link with a call to ReLink(). Practical examples are "redirect_port udp" rules for unidirectional SYSLOG protocol (port 514) or some low volume VPN encapsulated in UDP. Thanks to Peter Much for initial analysis and first version of a patch. Reported by: Peter Much <pmc@citylink.dinoex.sub.org> PR: 269770 (cherry picked from commit 8132e959099f0c533f698d8fbc17386f9144432f) (cherry picked from commit e5b85380836378c9e321a4e6d300591e6faf622a) sys/netinet/libalias/alias_db.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)