Bug 174087 - [tcp] Problems with ephemeral port selection
Summary: [tcp] Problems with ephemeral port selection
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: Mike Karels
URL: https://reviews.freebsd.org/D24781
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-03 15:40 UTC by Keith Arner
Modified: 2021-03-03 00:04 UTC (History)
5 users (show)

See Also:


Attachments
mytmp (10.01 KB, application/octet-stream)
2012-12-12 15:29 UTC, karner
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Keith Arner 2012-12-03 15:40:00 UTC
Date:      Fri, 30 Nov 2012 09:09:08 -0500
From:      Keith Arner <vornum@gmail.com>
To:        freebsd-net@freebsd.org
Subject:   Problems with ephemeral port selection
Message-ID:  <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg@mail.gmail.com>

I've noticed some issues with ephemeral port number selection from
tcp_connect(), which limit the number of concurrent, outgoing connections
that can be established (connect(), rather than accept()).  Sifting through
the source code, I believe the issuess stem from two problems in the
tcp_connect() code path.  Specifically:

 1) The wrong function gets called to determine if a given ephemeral
    port number is currently usable.
 2) The ephemeral port number gets selected without considering the
    foreign addr/port.

Curiously, the effect of #1 mostly cancels the effect of #2, such that
the common calling convention gives you a correct result so long as you
only have a small number of outgoing connections.  However, once you get to
a large number of outgoing connections, things start to break down.  (I'll
define large and small later.)

As a side note, I have been working with FreeBSD 7.2.  The implementations
of several of the relevant functions have been refactored somewhere between
7.2-RELEASE and 9-STABLE, but the core problems in the logic seem to be
the same between versions.

For problem #1, the code path that selects the ephemeral port number is:
 tcp_connect() ->
   in_pcbbind() ->
     in_pcbbind_setup() ->
       in_pcb_lport() [not in FreeBSD 7.2] ->
         in_pcblookup_local()

There is a loop in in_pcb_lport() [or directly in in_pcbbind_setup() in
earlier releases] that considers candidate ephemeral port numbers and
calls in_pcblookup_local() to determine if a given candidate is suitable.
The default behaviour (if the caller has not set either SO_REUSEADDR or
SO_REUSEPORT) is to pick a local port number that is not in use by
*any* local TCP socket.

So long as the number of concurrent, outgoing connections is less than the
range configured by `sysctl net.inet.ip.portrange.*`, selecting a totally
unique ephemeral port number works OK.  However, you cannot exceed that
limit, even if each outgoing connection has a unique faddr/fport.  This
does not limit the number of connections that can be accept()'ed, only the
number of connections that can be connect()'ed.

In this particular path, I think the code should call in_pcblookup_hash(),
rather than in_pcblookup_local().  The criteria in in_pcblookup_hash() only
match if the full 5-tuple matches, rather than just the local port number.
The complication, of course, comes from the fact that in_pcbbind() is
called from both bind() and for the implicit bind that happens for a
connect().  The matching criteria in in_pcblookup_local() make sense for
the former but not quite for the later.

I mentioned that the above is the default behaviour you get when you don't
specify SO_REUSEADDR or SO_REUSEPORT.  Setting SO_REUSEADDR
before calling connect() has some surprizing consequences (surprizing in the
sense that I don't believe SO_REUSEADDR is supposed to have any effect
on connect()).  In this case, when in_pcblookup_local() is called, wild_okay
is set to false.  This changes the matching criteria to (in effect) allow
tcp_connect() to use the full 5-tuple space.  However, this brings us to the
second problem.

Problem #2 is that the ephemeral port number is chosen before the
fport/faddr gets set on the pcb; that is tcp_connect() calls in_pcbbind() to
select the ephemeral port number, *then* calls in_pcbconnect_setup() to
populate the fport/faddr.  With SO_REUSEADDR, in_pcbbind() can select
an in-use local port.  If the local port is used by a socket with a different
laddr/fport/faddr, all is good.  However, if the local port selection
results in a
full conflict it will get rejected by the call to in_pcblookup_hash() inside
in_pcbconnect_setup().  This happens *after* the loop inside
in_pcbbind(), so the call to tcp_connect() fails with EADDRINUSE.  Thus,
with SO_REUSEADDR, connect() can fail with EADDRINUSE long before
the ephemeral port space has been exhausted.  The application could re-try
the call to connect() and likely succeed, as a new local port would be
selected.

Overall, this behaviour hinders the ability to open a large number of
outbound connections:
 * If you don't specify SO_REUSEADDR, you have a fairly limited maximum
   number of outbound connections.
 * If you do specify SO_REUSEADDR, you are able to open a much larger
   number of outbound connections, but must retry on EADDRINUSE.

I believe that the logic under tcp_connect() should be modified to:

 - behave uniformly whether or not SO_REUSEADDR has been set
 - allow outgoing connection requests to re-use a local port number, so
   long as the remaining elements of the tuple (laddr, fport, faddr) are
   unique


==========
Follow-up from the freebsd-net mailing list:

Date:      Sat, 01 Dec 2012 11:31:31 -0300
From:      Fernando Gont <fernando@gont.com.ar>
To:        Keith Arner <vornum@gmail.com>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Problems with ephemeral port selection
Message-ID:  <50BA14C3.4070601@gont.com.ar>
In-Reply-To: <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg@mail.gmail.com>
References:  <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg@mail.gmail.com>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help

Hi, Keith,

On 11/30/2012 11:09 AM, Keith Arner wrote:
>
>  - behave uniformly whether or not SO_REUSEADDR has been set
>  - allow outgoing connection requests to re-use a local port number, so
>    long as the remaining elements of the tuple (laddr, fport, faddr) are
>    unique

Please take a look at the discussion on how to "steal" incomming
connections in Section 3.1 of RFC 6056.

Cheers,
-- 
Fernando Gont
e-mail: fernando@gont.com.ar || fgont@si6networks.com
PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1

How-To-Repeat: connect() a large number of sockets, specifying SO_REUSEADDR before
calling connect().  Note that the call to connect() fails with
EADDRINUSE long before we run into any resource exhaustion.

Then connect() a large number of sockets, without specificying
SO_REUSADDR (while all the previous sockets are still open).  Note
that connect() then fails with EADDRNOTAVAIL;  this occurs as soon
as the total number of outgoing connections equals the ephemeral
port range.



#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <arpa/inet.h>

int last_child = -1;

#define complain(exit_val)                      \
    {                                           \
        return(exit_val);                       \
    }

int SockOpt(int s, int level, int opt)
{
    int opt_val = 1;
    int ret = setsockopt(s, level, opt, &opt_val, sizeof(opt_val));
    if (ret) {
        perror("Could not setsockopt() on socket");
        complain(-1);
    }
    return 0;
}


int open_server(int port)
{
    int ret;
    struct sockaddr_in sin;

    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl(INADDR_ANY);
    sin.sin_port = htons(port);


    int server = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
    if (server < 0) {
        perror("Could not open server socket");
        complain(-1);
    }

    SockOpt(server, SOL_SOCKET, SO_REUSEADDR);

    ret = bind(server, (struct sockaddr *)&sin, sizeof(sin));
    if (ret) {
        perror("Could not bind() server socket");
        complain(-1);
    }

    ret = listen(server, 5);
    if (ret) {
        perror("Could not listen() server socket");
        complain(-1);
    }

    return server;
}

int cycle_client(int server, int iteration, int port, int reuse)
{
    int ret;
    struct sockaddr_in sin;

    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    sin.sin_port = htons(port);

    int client = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
    if (client < 0) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
        perror("Could not open client socket");
        complain(-1);
    }

    if (reuse) {
        SockOpt(client, SOL_SOCKET, SO_REUSEADDR);
    }

    ret = connect(client, (struct sockaddr *)&sin, sizeof(sin));
    if (ret) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
        perror("Could not connect() client socket");
        complain(-1);
    }

    int len;
    int child = accept(server, (struct sockaddr *)&sin, &len);
    if (child < 0) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
        perror("Could not accept() child socket");
        complain(-1);
    }

    /* Why are we not closing the sockets?
     *
     * The point of this program is to illustrate the behaviour of the
     *  network stack when we open (or, rather connect()) a large number of
     *  outgoing sockets.  Thus, we want the sockets to linger around, to
     *  consume ephemeral port numbers.  Note that we could get largely
     *  similar behaviour by closing the sockets (if we close the client
     *  socket first), as the pcbs would linger in the TIME_WAIT state,
     *  consuming emphemeral port numbers.  
     *
     * Note that because TIME_WAIT connections count against up, the
     *  behaviour being illustrated does not rely on a large number of
     *  concurrent connections, just a large number of outgoing connections
     *  established over a short time period.  But it is easier to understand
     *  the operation of this program if we leave the sockets open.
    /* 
    ret = close(client);
    if (ret) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration), errno;
        perror("Could not close() client");
        complain(-1);
    }
    */

    /*
    if (last_child) {
        ret = close(child);
        if (ret) {
            fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
            perror("Could not close() child");
            complain(-1);
        }
    }
    */

    last_child = child;

    return 0;
}

/* Main loop to illustrate ephemeral port number behaviour.*/
int main(int argc, void **argv)
{
    /* num_iterations: How many sockets do we want to try to open per remote
     *  port number?  Should be set higher than the number of unique
     *  ephemeral port numbers that the stack can choose from.  With the
     *  default FreeBSD settings, that works out to:
     *
     *  net.inet.ip.portrange.last: 65535
     *  net.inet.ip.portrange.first: 49152
     *
     *  65535 - 49152 = 16383
     */
    int num_iterations = 20 * 1000;

    /* num_ports: How many distinct remote ports to we want to connect to? */
    int num_ports = 2;

    /* port: base, remote port number to connect to */
    int port = 12345;

    /* reuse: Should we set SO_REUSEADDR before calling connect()?
     *  Note that we alternate this value each for each remote port, to
     *  illustrate the differences in behaviour between setting it or not. */
    int reuse = 1;

    int port_loop;

    for (port_loop=0; port_loop<num_ports; port_loop++) {
        /* Set up a listening socket on the next remote port number. */
        int server = open_server(port);

        int i=0;
        for(; i<num_iterations; i++) {
            /* Open a bunch of sockets; and bail out on the first failure. */
            if (cycle_client(server, i, port, reuse)) {
                break;
            }
        }
        /* How many connections did we manage to establish on this port
         *  number (and with this "reuse" setting)?  If all is working,
         *  we ought to be able to establish as many connections as there
         *  are ephemeral ports, and we ought to be able to do so for each
         *  remote port number (baring memory exhaustion problems). */
        fprintf(stderr, "port %d; reuse %d; opened %d\n",
               port, reuse, i);

        /* Advance to the next remote port, and toggle whether we set
         *  SO_REUSEADDR. */
        port++;
        reuse = !reuse;
    }
    return 0;
}
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2012-12-09 17:56:08 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).
Comment 2 Andre Oppermann freebsd_committer freebsd_triage 2012-12-11 22:01:32 UTC
Responsible Changed
From-To: freebsd-net->andre

Looking into it.
Comment 3 karner 2012-12-12 15:29:09 UTC
Here is a preliminary patch that I've developed against the
7.2 release.

The majority of this diff is refactoring theport selection
loop out of in_pcbbind_setup() and into in_pcb_lport().  This
portion of the patch I largely lifted from the 9.0 release.
This version of in_pcb_lport() does not have the #ifdefs
for INET or INET6 as I did not want to get too distracted
by those details on a first pass.

The meat of the change can be summarized as follows:

 1) tcp_connect() no longer calls in_pcbbind() explicitly;
    rather local port allocation is deferred to be done
    from in_pcbbonnect_setup().
 2) Within in_pcbconnect_setup() the call to in_pcbbind_setup()
    is changed to in_pcb_lport().  I could have left the call
    to in_pcbbind_setup(), but
    A) doing so would have meant a messy change to the function
       signature of in_pcbbind_setup(), and
    B) this particular call to in_pcbbind_setup() was being
       done *only* for local port allocation, so the bulk of
       the function body was being skipped anyway
 3) in_pcb_lport() is augmented to take the fport/faddr to
    check for uniqueness, and also an "ephemeral" argument
    to determine whether to take fport/faddr into account.

With this change, I'm able to allocate the full available
ephemeral port range both with and without SO_REUSEADDR,
and without needing to retry on EADDRINUSE.  I can now open
sockets until I exhaust memory/buffer space, rather than
running into ephemeral port allocation problems.

One important thing to note about this change is that it
does *not* take into account the presence of sockets
listening on a port number in the ephemeral port space
(see the advice about port stealing in section 3.1 of
RFC 6056).  Prior to this patch, the logic would avoid
listening sockets as it would avoid any re-use of a
local port number.

Keith
Comment 4 Hiren Panchasara freebsd_committer freebsd_triage 2016-12-22 20:01:44 UTC
Assigning back to the pool as this seems like a valid problem from a cursory glance.
Comment 5 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:41:02 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
Comment 6 Richard Scheffenegger freebsd_committer freebsd_triage 2020-05-24 18:11:51 UTC
I believe D24781 / rS361228 fixes this particular issue.
Comment 7 Mark Linimon freebsd_committer freebsd_triage 2020-06-20 16:37:26 UTC
Mike believes that the following commits should fix the issue:

  https://svnweb.freebsd.org/base?view=revision&revision=361228
  https://svnweb.freebsd.org/base?view=revision&revision=361231

which were as a result of https://reviews.freebsd.org/D24781 (which Mike had seen before noticing this PR).

To submitter: I know that this is a very old bug, but, just in case, are you in a position to be able to test and verify?
Comment 8 Mark Linimon freebsd_committer freebsd_triage 2020-06-20 16:40:03 UTC
^Triage: Assign to committer resolving.
Comment 9 Keith Arner 2020-08-13 16:34:36 UTC
> To submitter: I know that this is a very old bug, but, just in case,
> are you in a position to be able to test and verify?

Sorry, no.  I no longer have access to the machines or environment
where I encountered the problem.
Comment 10 Ravi Pokala 2020-08-13 17:07:57 UTC
(In reply to Keith Arner from comment #9)

Hey Keith! Long time, no see! Hope you're doing well. :-)

This was found internally at Panasas. I have pinged one of the current engineers who might be familiar with this, and will relay their response.
Comment 11 Eugene M. Zheganin 2020-08-28 04:21:18 UTC
Stepped onto this, upgraded to 12.1-STABLE r364883 : it's now fixed. netstat -an | awk '{print $4}' | awk -F. '{print $5}' | sort | uniq -c | grep -v 1\  gives me 2 and 3's, and I was seeing only 1's on the 12.1-RELEASE.
Comment 12 Mike Karels freebsd_committer freebsd_triage 2021-03-03 00:04:22 UTC
This was fixed some time back, as confirmed in Comment 11.