With 'options RSS' set the 'pf:syncookie:forward' test case often fails. The behaviour is not consistent, but it is reproducible at least 50% of the time. The failure mode is that the TCP connection is established (full SYN/SYN+ACK/ACK) exchange, but the first data packet receives a RST in response. Up to that point the exchange is identical between working and failing connections. Within the same setup (i.e. without re-creating or re-configuring jails) the connection will sometimes succeed and sometimes fail. The test forwards a TCP connection, terminating on a vnet jail on the same host. Because pf's syncookie feature is enabled the code path passes through pf_send(), which calls a swi, transmitting the packet from (potentially) a different CPU. This appears to be due to a race condition in syncache_expand(), where we remove the sync cache entry (`TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash);`) prior to actually opening the connection through syncache_socket() (i.e. inserting the inpcb through in_pcbconnect()). If the race condition is triggered, we end up not finding the inpcb for the now open connection, instead trying to look up the connection in the syncache when the first data packet arrives, failing to find it and resetting the entire connection.
I guess you are referring to a race between a) an incoming ACK segment being the third segment of the three way hand shake b) an incoming non-SYN segment (for example the first data segment from the peer The non-SYN segment is responded with a RST, right? If that is true, I discussed such a race with glebius@ recently...
(In reply to Michael Tuexen from comment #1) Correct, yes. Here's a pcap with a few attempts, some successful, some failing: http://people.freebsd.org/~kp/265154.pcap
(In reply to Kristof Provost from comment #2) It is great that we have a reproducer. I added Gleb, since we were discussing this in review D35730. My fix was wrong as indicated by Gleb, but I think you found a way to hit the race condition.
There are two more problems around this place. 1) There is a problem I recently introduced. If syncache_socket() fails to do PCB insertion, it will call soabort(). This soabort() will release last reference (that belongs to listen queue) and will go with sofree() of a socket that is still on listening queue. Note that all other non-TCP consumers of sonewconn() do not have this problem, as they just set so_error instead of call to soabort(). 2) The 6f3caa6d8159 very likely needs to be reverted. With modern synchronization in the network stack (epoch + SMR) it is very likely not an optimization. I think all 3 problems can be nailed with a single patch, that will introduce pr_newconn() to be executed instead of pr_attach() for sonewconn(). It will have slightly different semantic than pr_attach(). I'm going to work on that in August, as today I'm leaving for a vacation. I think all listed problems aren't critical and can be left as as for a few weeks.
(In reply to Gleb Smirnoff from comment #4) > I'm going to work on that in August, as today I'm leaving for a vacation. I think all listed problems aren't critical and can be left as as for a few weeks. Enjoy your vacation. This issue is not urgent for me. I looked at it because there's reports of a NAT issue in pf when RSS is enabled and I was hoping that the failing test case was related. It's not, so my interest in this problem is somewhat limited. Also, if I understand what's happening correctly I'd be very surprised that it can be triggered for connections that actually leave the machine (i.e. are not between jails on the same machine), which also makes it less urgent.
(In reply to Gleb Smirnoff from comment #4) Enjoy your vacation and thanks for working on a patch.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d88eb4654f372d0451139a1dbf525a8f2cad1cf8 commit d88eb4654f372d0451139a1dbf525a8f2cad1cf8 Author: Gleb Smirnoff <glebius@FreeBSD.org> AuthorDate: 2022-08-10 14:32:37 +0000 Commit: Gleb Smirnoff <glebius@FreeBSD.org> CommitDate: 2022-08-10 14:32:37 +0000 tcp: address a wire level race with 2 ACKs at the end of TCP handshake Imagine we are in SYN-RCVD state and two ACKs arrive at the same time, both valid, e.g. coming from the same host and with valid sequence. First packet would locate the listening socket in the inpcb database, write-lock it and start expanding the syncache entry into a socket. Meanwhile second packet would wait on the write lock of the listening socket. First packet will create a new ESTABLISHED socket, free the syncache entry and unlock the listening socket. Second packet would call into syncache_expand(), but this time it will fail as there is no syncache entry. Second packet would generate RST, effectively resetting the remote connection. It seems to me, that it is impossible to solve this problem with just rearranging locks, as the race happens at a wire level. To solve the problem, for an ACK packet arrived on a listening socket, that failed syncache lookup, perform a second non-wildcard lookup right away. That lookup may find the new born socket. Otherwise, we indeed send RST. Tested by: kp Reviewed by: tuexen, rrs PR: 265154 Differential revision: https://reviews.freebsd.org/D36066 sys/netinet/tcp_input.c | 45 +++++++++++++++++++++++++++++++-------------- 1 file changed, 31 insertions(+), 14 deletions(-)
Kristof, if you want to merge it to a stable, please re-open the bug and grab ownership & responsibility for that. Thanks a lot for all the testing!
(In reply to Gleb Smirnoff from comment #8) I'm not immediately interested in that. pfSense isn't (as far as I know) actually affected by this, and we're moving to main anyway.