Bug 265154 - tcp: syncache_expand() potential race
Summary: tcp: syncache_expand() potential race
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Gleb Smirnoff
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-07-11 15:51 UTC by Kristof Provost
Modified: 2022-08-10 15:34 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kristof Provost freebsd_committer freebsd_triage 2022-07-11 15:51:02 UTC
With 'options RSS' set the 'pf:syncookie:forward' test case often fails.
The behaviour is not consistent, but it is reproducible at least 50% of the time.

The failure mode is that the TCP connection is established (full SYN/SYN+ACK/ACK) exchange, but the first data packet receives a RST in response. Up to that point the exchange is identical between working and failing connections. Within the same setup (i.e. without re-creating or re-configuring jails) the connection will sometimes succeed and sometimes fail.

The test forwards a TCP connection, terminating on a vnet jail on the same host. Because pf's syncookie feature is enabled the code path passes through pf_send(), which calls a swi, transmitting the packet from (potentially) a different CPU.

This appears to be due to a race condition in syncache_expand(), where we remove the sync cache entry (`TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash);`) prior to actually opening the connection through syncache_socket() (i.e. inserting the inpcb through in_pcbconnect()).

If the race condition is triggered, we end up not finding the inpcb for the now open connection, instead trying to look up the connection in the syncache when the first data packet arrives, failing to find it and resetting the entire connection.
Comment 1 Michael Tuexen freebsd_committer freebsd_triage 2022-07-12 12:23:03 UTC
I guess you are referring to a race between
a) an incoming ACK segment being the third segment of the three way hand shake
b) an incoming non-SYN segment (for example the first data segment from the peer
The non-SYN segment is responded with a RST, right?
If that is true, I discussed such a race with glebius@ recently...
Comment 2 Kristof Provost freebsd_committer freebsd_triage 2022-07-12 12:46:22 UTC
(In reply to Michael Tuexen from comment #1)
Correct, yes. Here's a pcap with a few attempts, some successful, some failing: http://people.freebsd.org/~kp/265154.pcap
Comment 3 Michael Tuexen freebsd_committer freebsd_triage 2022-07-12 16:24:08 UTC
(In reply to Kristof Provost from comment #2)
It is great that we have a reproducer. I added Gleb, since we were discussing this in review D35730. My fix was wrong as indicated by Gleb, but I think you found a way to hit the race condition.
Comment 4 Gleb Smirnoff freebsd_committer freebsd_triage 2022-07-12 16:43:46 UTC
There are two more problems around this place.

1) There is a problem I recently introduced. If syncache_socket() fails to do PCB insertion, it will call soabort(). This soabort() will release last reference (that belongs to listen queue) and will go with sofree() of a socket that is still on listening queue. Note that all other non-TCP consumers of sonewconn() do not have this problem, as they just set so_error instead of call to soabort().
2) The 6f3caa6d8159 very likely needs to be reverted. With modern synchronization in the network stack (epoch + SMR) it is very likely not an optimization.

I think all 3 problems can be nailed with a single patch, that will introduce pr_newconn() to be executed instead of pr_attach() for sonewconn(). It will have slightly different semantic than pr_attach().

I'm going to work on that in August, as today I'm leaving for a vacation. I think all listed problems aren't critical and can be left as as for a few weeks.
Comment 5 Kristof Provost freebsd_committer freebsd_triage 2022-07-12 16:50:10 UTC
(In reply to Gleb Smirnoff from comment #4)
> I'm going to work on that in August, as today I'm leaving for a vacation. I think all listed problems aren't critical and can be left as as for a few weeks.

Enjoy your vacation. This issue is not urgent for me. I looked at it because there's reports of a NAT issue in pf when RSS is enabled and I was hoping that the failing test case was related. It's not, so my interest in this problem is somewhat limited.

Also, if I understand what's happening correctly I'd be very surprised that it can be triggered for connections that actually leave the machine (i.e. are not between jails on the same machine), which also makes it less urgent.
Comment 6 Michael Tuexen freebsd_committer freebsd_triage 2022-07-12 17:27:38 UTC
(In reply to Gleb Smirnoff from comment #4)
Enjoy your vacation and thanks for working on a patch.
Comment 7 commit-hook freebsd_committer freebsd_triage 2022-08-10 14:32:57 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d88eb4654f372d0451139a1dbf525a8f2cad1cf8

commit d88eb4654f372d0451139a1dbf525a8f2cad1cf8
Author:     Gleb Smirnoff <glebius@FreeBSD.org>
AuthorDate: 2022-08-10 14:32:37 +0000
Commit:     Gleb Smirnoff <glebius@FreeBSD.org>
CommitDate: 2022-08-10 14:32:37 +0000

    tcp: address a wire level race with 2 ACKs at the end of TCP handshake

    Imagine we are in SYN-RCVD state and two ACKs arrive at the same time,
    both valid, e.g. coming from the same host and with valid sequence.

    First packet would locate the listening socket in the inpcb database,
    write-lock it and start expanding the syncache entry into a socket.
    Meanwhile second packet would wait on the write lock of the listening
    socket.  First packet will create a new ESTABLISHED socket, free the
    syncache entry and unlock the listening socket.  Second packet would
    call into syncache_expand(), but this time it will fail as there
    is no syncache entry.  Second packet would generate RST, effectively
    resetting the remote connection.

    It seems to me, that it is impossible to solve this problem with
    just rearranging locks, as the race happens at a wire level.

    To solve the problem, for an ACK packet arrived on a listening socket,
    that failed syncache lookup, perform a second non-wildcard lookup right
    away.  That lookup may find the new born socket.  Otherwise, we indeed
    send RST.

    Tested by:              kp
    Reviewed by:            tuexen, rrs
    PR:                     265154
    Differential revision:  https://reviews.freebsd.org/D36066

 sys/netinet/tcp_input.c | 45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)
Comment 8 Gleb Smirnoff freebsd_committer freebsd_triage 2022-08-10 14:44:49 UTC
Kristof, if you want to merge it to a stable, please re-open the bug and grab ownership & responsibility for that. Thanks a lot for all the testing!
Comment 9 Kristof Provost freebsd_committer freebsd_triage 2022-08-10 15:34:46 UTC
(In reply to Gleb Smirnoff from comment #8)
I'm not immediately interested in that. pfSense isn't (as far as I know) actually affected by this, and we're moving to main anyway.