Created attachment 194342 [details] Reconstruct rt_kif in pfsync_state_import I use FreeBSD and pf on routers and hardware loadbalancers. Routers do normal routing and have firewalls with only block or pass rules. Loadbalancers use route-to rules with tables of target hosts. On routers pfsync works just fine while on loadbalancers it fails because states are synced without target interface. There are 2 ways to fix it: 1. Modify struct pfsync_state to include target interface, but that would be breaking compatibility. 2. Reconstruct missing interface using rules on the second loadbalancer. Please find attached patch solving the issue using the 2nd method. There is still the issue of source_nodes not being synced, they probably can be reconstructed in a similar fashion. I might provide a patch for that later on. This the 1st version of the patch, I am not totally sure of its stability and it is designed only to solve the issue in my particular case, that is for rules with the following syntax: "route-to (internal4027 <pool_154571_4>) round-robin"
I came across an issue preventing this from working correctly when rebooting hardware: pfsync is started before pf (or in my case before my custom service populating pf rules. That's a problem, because for route-to interface to be correctly rebuilt, pf rules must be already present. I'm unsure if changing this order is a good idea, maybe it is like this for a good reason?
While looking on possibility of recreating src_nodes I found that the way src nodes are created is rather sketchy. For example when a new state is created with new src_node, first a node is searched for, if none is found then it is created and inserted, each of those operation with its own locking and unlocking src_node hash. Operations within pf_map_addr operate on unlocked src_node which probably explain crashes I had when flushing nodes on heavily loaded system. Then there is the issue that each locking and unlocking operation requires computing the hash again, why not compute it once and store it within the node, this way unlocking could be a bit faster. Creation of node could return it locked as now it needs to be re-locked for any further operations. Then there is the issue that pf_state->rt_kif is copied from pf_rule->rpool.cur which might not be the same as during pf_map_addr() (there is no locking inside that function so it might be inconsistent anyway). And last but not the least is that it seems to me that pf_src_node->*kif is not used at all. And src_node itself never stores information about interface choosen for route-to targets, it is only copied to state instead. I will prepare a patch addressing those issues first and then work on recreating redirection interface as originally this issue was about. The proof of concept patch seems generally working, I configured my firewall service to start before pfsync and I could reboot my load balancers as I pleased and traffic was correctly forwarded.
I found another bug: states synced during initial bulk update are considered to come from incompatible ruleset, even if ruleset *is* compatible. I also must raise a question why the initial sync is "update" and not "insert". --- a/sys/netpfil/pf/if_pfsync.c +++ b/sys/netpfil/pf/if_pfsync.c @@ -874,21 +874,21 @@ pfsync_in_upd(struct pfsync_pkt *pkt, struct mbuf *m, int offset, int count) printf("pfsync_input: PFSYNC_ACT_UPD: " "invalid value\n"); } V_pfsyncstats.pfsyncs_badval++; continue; } st = pf_find_state_byid(sp->id, sp->creatorid); if (st == NULL) { /* insert the update */ - if (pfsync_state_import(sp, 0)) + if (pfsync_state_import(sp, pkt->flags)) V_pfsyncstats.pfsyncs_badstate++; continue; } if (st->state_flags & PFSTATE_ACK) { PFSYNC_LOCK(sc); pfsync_undefer_state(st, 1); PFSYNC_UNLOCK(sc); } This bug is fixed in OpenBSD some time ago: https://github.com/openbsd/src/commit/ddb7828bc6708358e6c08caaf09e3524e8cab7b4
A commit references this bug: Author: kp Date: Sat Jun 30 12:51:08 UTC 2018 New revision: 335816 URL: https://svnweb.freebsd.org/changeset/base/335816 Log: pfsync: Fix state sync during initial bulk update States learned via pfsync from a peer with the same ruleset checksum were not getting assigned to rules like they should because pfsync_in_upd() wasn't passing the PFSYNC_SI_CKSUM flag along to pfsync_state_import. PR: 229092 Submitted by: Kajetan Staszkiewicz <vegeta tuxpowered.net> Obtained from: OpenBSD MFC after: 1 week Sponsored by: InnoGames GmbH Changes: head/sys/netpfil/pf/if_pfsync.c
Thanks for that patch. I've not yet had the opportunity to look at the other patch (or remarks in any detail). I'll try to do so as soon as possible, but it may be some time.
A commit references this bug: Author: kp Date: Sat Jul 7 14:46:02 UTC 2018 New revision: 336064 URL: https://svnweb.freebsd.org/changeset/base/336064 Log: MFC r335816: pfsync: Fix state sync during initial bulk update States learned via pfsync from a peer with the same ruleset checksum were not getting assigned to rules like they should because pfsync_in_upd() wasn't passing the PFSYNC_SI_CKSUM flag along to pfsync_state_import. PR: 229092 Submitted by: Kajetan Staszkiewicz <vegeta tuxpowered.net> Obtained from: OpenBSD Sponsored by: InnoGames GmbH Changes: _U stable/11/ stable/11/sys/netpfil/pf/if_pfsync.c
Do we consider breaking pfsync protocol compatibility? If we could just modify the protocol to sync redirection interface, there would be no need for reconstrucing it and for identical ruleset with identical table contents.
(In reply to Kajetan Staszkiewicz from comment #7) I'd be very very hesitant to break compatibility. A common pattern with pfsync is that one gateway is upgraded while the other takes over. That'll need to keep working. That said, it might be possible to extend the protocol by using one of the _pad fields. It'd have to work (minus newly supported/improved cases) when syncing with older code, but that might be possible.
I see only those fields free to be used: struct pfsync_state { u_int8_t __spare[2]; } struct pfsync_state_peer { u_int8_t pad[6]; } None of them is enough to carry char ifname[IFNAMSIZ] information. I thought interfaces maybe have some increasing ID which would fit into those bytes but I can't find such thing. We could add such increasing ID to pfi_kif but that would still be an opportunistic solution, working correctly only if two routers have identical interfaces which were added in the same order. That might in some situations be even harder to achieve than having identical ruleset as required by the patch I proposed.
Good point. I don't see an immediate straightforward way of handling this. I was going to suggest a new action type, but pfsync_input() stops handling the entire packet when it encounters an unknown action type. We'd either have to make sure the new type is always at the end of the packet, or we'd have to teach pfsync to ignore unknown action types first, and only later (once we hope everyone has upgraded) add the new one.
Created attachment 201346 [details] Reconstruct interface route by standard fib lookup I found another issue. Even if we can somehow reconstruct route interface, there is still a requirement for having identical ruleset on both routers because it is rule->rt which makes Route-to, Duplicate-to and Reply-to targets work. This information is never kept in state. Attached patch solves this issue by copying rule->rt to state->rt (new field). Pfsync struct got this field too. Route interface is reconstructed by normal lookup in routing table in fib 0. Warning: for "no state" rules stil rule->rt must be used and I have coded it but not tested. For stateful ruleset all seems fine for route-to target.
(In reply to Kajetan Staszkiewicz from comment #11) Wouldn't the pfcksum protect us from having different rules in the first place?
(In reply to Kristof Provost from comment #12) pfcksum only checks if loaded rules are the same, it does not ensure rules are the same on 2 routers. There are a few ways to have different rulesets, let me give you a little list I came across while trying to make pfsync work: - Any rule using interface IP addresses in unnamed table {} will end up being different on 2 routers unless named <table> {} is used. - Same thing for SNAT rules, although I'm unsure if those are included in pfchecksum. - If ruleset is dynamically generated by a script, data structure might not have explicit ordering and produce different result on each run: for me it was Python and its dictionaries and sets. - In a dynamical environment it might happen that the ruleset is different for short periods of time when new configuration is applied as it will never be applied at exactly the same time on both routers. For me on some loadbalancers new configuration is applied tens of times a day.
To sum it up: I don't think it is feasible to have any functionality depending on ruleset being identical. It is really hard to achieve it and it might not be worth the effort.
(In reply to Kajetan Staszkiewicz from comment #13) > - Any rule using interface IP addresses in unnamed table {} will end up being different on 2 routers unless named <table> {} is used. Ah, because pf generates a random id for the table? I'd argue that that's something the rules sync script (if there is one) should account for, but I'd be happy to take patches to make that 'random id' predictable (and consistent across hosts). > - Same thing for SNAT rules, although I'm unsure if those are included in pfchecksum. I'm not sure what you mean by SNAT rules. The pf_setup_pfsync_matching() function checksums all rules, other than the scrub rules. > - If ruleset is dynamically generated by a script, data structure might not have explicit ordering and produce different result on each run: for me it was Python and its dictionaries and sets. I don't understand this one. It shouldn't matter how rules are generated, the kernel will calculate a checksum. Or do you mean to say pf should compensate for bugs in synchronisation scripts? I don't really see a way around the requirement for the ruleset to be identical on all pfsync synced hosts.
(In reply to Kristof Provost from comment #15) > (In reply to Kajetan Staszkiewicz from comment #13) > >> - Any rule using interface IP addresses in unnamed table {} will end up >> being different on 2 routers unless named <table> {} is used. > > Ah, because pf generates a random id for the table? I think so. > I'd argue that that's > something the rules sync script (if there is one) I don't "sync" rules. I "generate" on central database and upload to loadbalancers. Generated files look identical, line by line. (+/- Python issue, I will comment on it later). > should account for, but I'd Taking that into account is exactly what was needed in my case. Consider such two rules: 1. allow in on $IFACE from { $HOST1 $HOST2 } Table used here is unnamed, anonymous, dynamic or however it is called in the world of pf. There is no guarantee of its name and thus even if configuration is generated centrally, it will result in ruleset having different checksum on each loadbalancer. But is there even any real table used at all? I remember something about dynamically generated table names but what I see is expansion of ruleset during loading into separate rules. e.g. rule: rdr on $if_public inet6 proto ipv6-icmp from any to $if_public -> <own_addrs_pub> got expanded to 2 rules: rdr on public inet6 proto ipv6-icmp from any tofe80::6a05:caff:fe0b:dd02 -> <own_addrs_pub> round-robin rdr on public inet6 proto ipv6-icmp from any to 2a00:XXXXXX -> <own_addrs_pub> round-robin (BTW, expansion to link-local addresses seems a bug to me, I will report it separately). 2. table <table> { $HOST1 $HOST2 } pass in on $IFACE from <table> Here table is named. Ruleset is now consitent between loadbalancers no matter the contents of table. > be happy to take patches to make that 'random id' predictable (and consistent > across hosts). Maybe one day but for now I already forced usage of named tables everywhere. >> - Same thing for SNAT rules, although I'm unsure if those are included in pfchecksum. > > I'm not sure what you mean by SNAT rules. Sorry, of course I meant NAT rules in pf. I very much prefer nftables terminology of SNAT and DNAT, they just make way more sense. > The pf_setup_pfsync_matching() > function checksums all rules, other than the scrub rules. That just adds one more type of rules that can screw up checksum, as I expected. >> - If ruleset is dynamically generated by a script, data structure might not >> have explicit ordering and produce different result on each run: for me it >> was Python and its dictionaries and sets. > > I don't understand this one. Data structures like sets and hashes have no explicit ordering, at least in Python. I think I was getting consistent results with Python 2.7 but totally random when moved to 3.5 Things put to them will be retrieved in some random order. One database of rules will produce functionally identical (at least as long as they are "quick" rules) firewall but with rules in different order. Of course pf can't do anything about it and this is expeced, see next paragraph. > It shouldn't matter how rules are generated, the > kernel will calculate a checksum. Or do you mean to say pf should compensate > for bugs That is not a bug. > in synchronisation scripts? No, it definitely should not. All I'm saying that it is another trap I've encountered while fighting with this topic and that it is very hard to make the ruleset identical from point of view of pf and we should not expect identical rulesets. > I don't really see a way around the requirement for the ruleset to be identical > on all pfsync synced hosts. But is there really such requirement with current status of pf? I think the whole discussion wandered away from the main topic. Let's get back on track. Current situation: 1. Identical pf.conf will result in different checksum in many cases due to interface addresses, dynamic table names and/or rule expansion from unnamed tables. 2. pfsync of normal firewall states which only pass or nat traffic don't need identical ruleset at all. 3. pfsync of states from route/dup/reply-to rules is *fully broken*. Let me repeat once again: none of *working* functionalities of pf seems to require identical ruleset. Maaaaybeee label counters? I want to focus on fixing issue 3. There are multiple aproaches: 1. Old patch which depends on ruleset being identical and reconstructing missing information from rules. 2. New patch which sends part of missing information (state->rt) over pfsync and discovers interface to use from normal route lookup. 3. Modify pfsync structure to io include both state->rt and state->rt_kif. I would *love* to have 3. implemmented but for now I work with 2. because 1. was way too unrealiable. How should we progress?
Right, for 3. we come back to the compatibility issue. pfsync has to remain able to run with different versions, so while we could potentially extend the protocol to include this information we *have* to make sure doing so won't break a host that doesn't understand the new fields. And vice versa: a host which doesn't include the information must be able to send state to a host which expects the extra information. That's probably possible, but it'll need some special care.
My 2nd patch stores missing state->rt information in currently unused part of struct pfsync_state. That should make it compatible. A router running non-patched kernel will simply not transmit any data there when sending states and ignore all data when receiving them from a patched router. So that part should be safe. What looks potentially unsafe is guessing of target interface. Although it is already badly broken, as packets are leaving router via route matching destination on unpatched kerel. Is guessing of target interface done correctly? Can I use fib lookup functions just like this? No locking needed?
There's a typo in the KASSERT (r_dir = PF_IN). I wonder if 'rt' can't be a flag. That'd give us more room for other extensions later.
'rt' contains values from enum { PF_NOPFROUTE, PF_FASTROUTE, PF_ROUTETO, PF_DUPTO, PF_REPLYTO }. I don't see how those could be squashed into a single flag, as they dictate differenct actions to be taken against packet.
(In reply to Kajetan Staszkiewicz from comment #20) You are of course correct here. I'd like to try to write a test case for this. Do you have any suggestions on how to best reproduce (as simple a version as possible of) the problematic behaviour? vnet lets us create arbitrary numbers of pf/pfsync instances, so it should be possible to reproduce this. See /usr/src/tests/sys/netpfil/pf/pfsync.sh if you're interested in examples.