Date: Fri, 30 Nov 2012 09:09:08 -0500 From: Keith Arner <vornum@gmail.com> To: freebsd-net@freebsd.org Subject: Problems with ephemeral port selection Message-ID: <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg@mail.gmail.com> I've noticed some issues with ephemeral port number selection from tcp_connect(), which limit the number of concurrent, outgoing connections that can be established (connect(), rather than accept()). Sifting through the source code, I believe the issuess stem from two problems in the tcp_connect() code path. Specifically: 1) The wrong function gets called to determine if a given ephemeral port number is currently usable. 2) The ephemeral port number gets selected without considering the foreign addr/port. Curiously, the effect of #1 mostly cancels the effect of #2, such that the common calling convention gives you a correct result so long as you only have a small number of outgoing connections. However, once you get to a large number of outgoing connections, things start to break down. (I'll define large and small later.) As a side note, I have been working with FreeBSD 7.2. The implementations of several of the relevant functions have been refactored somewhere between 7.2-RELEASE and 9-STABLE, but the core problems in the logic seem to be the same between versions. For problem #1, the code path that selects the ephemeral port number is: tcp_connect() -> in_pcbbind() -> in_pcbbind_setup() -> in_pcb_lport() [not in FreeBSD 7.2] -> in_pcblookup_local() There is a loop in in_pcb_lport() [or directly in in_pcbbind_setup() in earlier releases] that considers candidate ephemeral port numbers and calls in_pcblookup_local() to determine if a given candidate is suitable. The default behaviour (if the caller has not set either SO_REUSEADDR or SO_REUSEPORT) is to pick a local port number that is not in use by *any* local TCP socket. So long as the number of concurrent, outgoing connections is less than the range configured by `sysctl net.inet.ip.portrange.*`, selecting a totally unique ephemeral port number works OK. However, you cannot exceed that limit, even if each outgoing connection has a unique faddr/fport. This does not limit the number of connections that can be accept()'ed, only the number of connections that can be connect()'ed. In this particular path, I think the code should call in_pcblookup_hash(), rather than in_pcblookup_local(). The criteria in in_pcblookup_hash() only match if the full 5-tuple matches, rather than just the local port number. The complication, of course, comes from the fact that in_pcbbind() is called from both bind() and for the implicit bind that happens for a connect(). The matching criteria in in_pcblookup_local() make sense for the former but not quite for the later. I mentioned that the above is the default behaviour you get when you don't specify SO_REUSEADDR or SO_REUSEPORT. Setting SO_REUSEADDR before calling connect() has some surprizing consequences (surprizing in the sense that I don't believe SO_REUSEADDR is supposed to have any effect on connect()). In this case, when in_pcblookup_local() is called, wild_okay is set to false. This changes the matching criteria to (in effect) allow tcp_connect() to use the full 5-tuple space. However, this brings us to the second problem. Problem #2 is that the ephemeral port number is chosen before the fport/faddr gets set on the pcb; that is tcp_connect() calls in_pcbbind() to select the ephemeral port number, *then* calls in_pcbconnect_setup() to populate the fport/faddr. With SO_REUSEADDR, in_pcbbind() can select an in-use local port. If the local port is used by a socket with a different laddr/fport/faddr, all is good. However, if the local port selection results in a full conflict it will get rejected by the call to in_pcblookup_hash() inside in_pcbconnect_setup(). This happens *after* the loop inside in_pcbbind(), so the call to tcp_connect() fails with EADDRINUSE. Thus, with SO_REUSEADDR, connect() can fail with EADDRINUSE long before the ephemeral port space has been exhausted. The application could re-try the call to connect() and likely succeed, as a new local port would be selected. Overall, this behaviour hinders the ability to open a large number of outbound connections: * If you don't specify SO_REUSEADDR, you have a fairly limited maximum number of outbound connections. * If you do specify SO_REUSEADDR, you are able to open a much larger number of outbound connections, but must retry on EADDRINUSE. I believe that the logic under tcp_connect() should be modified to: - behave uniformly whether or not SO_REUSEADDR has been set - allow outgoing connection requests to re-use a local port number, so long as the remaining elements of the tuple (laddr, fport, faddr) are unique ========== Follow-up from the freebsd-net mailing list: Date: Sat, 01 Dec 2012 11:31:31 -0300 From: Fernando Gont <fernando@gont.com.ar> To: Keith Arner <vornum@gmail.com> Cc: freebsd-net@freebsd.org Subject: Re: Problems with ephemeral port selection Message-ID: <50BA14C3.4070601@gont.com.ar> In-Reply-To: <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg@mail.gmail.com> References: <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg@mail.gmail.com> Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help Hi, Keith, On 11/30/2012 11:09 AM, Keith Arner wrote: > > - behave uniformly whether or not SO_REUSEADDR has been set > - allow outgoing connection requests to re-use a local port number, so > long as the remaining elements of the tuple (laddr, fport, faddr) are > unique Please take a look at the discussion on how to "steal" incomming connections in Section 3.1 of RFC 6056. Cheers, -- Fernando Gont e-mail: fernando@gont.com.ar || fgont@si6networks.com PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1 How-To-Repeat: connect() a large number of sockets, specifying SO_REUSEADDR before calling connect(). Note that the call to connect() fails with EADDRINUSE long before we run into any resource exhaustion. Then connect() a large number of sockets, without specificying SO_REUSADDR (while all the previous sockets are still open). Note that connect() then fails with EADDRNOTAVAIL; this occurs as soon as the total number of outgoing connections equals the ephemeral port range. #include <sys/types.h> #include <sys/socket.h> #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <netinet/in.h> #include <netinet/tcp.h> #include <unistd.h> #include <sys/ioctl.h> #include <net/if.h> #include <arpa/inet.h> int last_child = -1; #define complain(exit_val) \ { \ return(exit_val); \ } int SockOpt(int s, int level, int opt) { int opt_val = 1; int ret = setsockopt(s, level, opt, &opt_val, sizeof(opt_val)); if (ret) { perror("Could not setsockopt() on socket"); complain(-1); } return 0; } int open_server(int port) { int ret; struct sockaddr_in sin; sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl(INADDR_ANY); sin.sin_port = htons(port); int server = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP); if (server < 0) { perror("Could not open server socket"); complain(-1); } SockOpt(server, SOL_SOCKET, SO_REUSEADDR); ret = bind(server, (struct sockaddr *)&sin, sizeof(sin)); if (ret) { perror("Could not bind() server socket"); complain(-1); } ret = listen(server, 5); if (ret) { perror("Could not listen() server socket"); complain(-1); } return server; } int cycle_client(int server, int iteration, int port, int reuse) { int ret; struct sockaddr_in sin; sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK); sin.sin_port = htons(port); int client = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP); if (client < 0) { fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno); perror("Could not open client socket"); complain(-1); } if (reuse) { SockOpt(client, SOL_SOCKET, SO_REUSEADDR); } ret = connect(client, (struct sockaddr *)&sin, sizeof(sin)); if (ret) { fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno); perror("Could not connect() client socket"); complain(-1); } int len; int child = accept(server, (struct sockaddr *)&sin, &len); if (child < 0) { fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno); perror("Could not accept() child socket"); complain(-1); } /* Why are we not closing the sockets? * * The point of this program is to illustrate the behaviour of the * network stack when we open (or, rather connect()) a large number of * outgoing sockets. Thus, we want the sockets to linger around, to * consume ephemeral port numbers. Note that we could get largely * similar behaviour by closing the sockets (if we close the client * socket first), as the pcbs would linger in the TIME_WAIT state, * consuming emphemeral port numbers. * * Note that because TIME_WAIT connections count against up, the * behaviour being illustrated does not rely on a large number of * concurrent connections, just a large number of outgoing connections * established over a short time period. But it is easier to understand * the operation of this program if we leave the sockets open. /* ret = close(client); if (ret) { fprintf(stderr, "Iteration %d, errno %d: ", iteration), errno; perror("Could not close() client"); complain(-1); } */ /* if (last_child) { ret = close(child); if (ret) { fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno); perror("Could not close() child"); complain(-1); } } */ last_child = child; return 0; } /* Main loop to illustrate ephemeral port number behaviour.*/ int main(int argc, void **argv) { /* num_iterations: How many sockets do we want to try to open per remote * port number? Should be set higher than the number of unique * ephemeral port numbers that the stack can choose from. With the * default FreeBSD settings, that works out to: * * net.inet.ip.portrange.last: 65535 * net.inet.ip.portrange.first: 49152 * * 65535 - 49152 = 16383 */ int num_iterations = 20 * 1000; /* num_ports: How many distinct remote ports to we want to connect to? */ int num_ports = 2; /* port: base, remote port number to connect to */ int port = 12345; /* reuse: Should we set SO_REUSEADDR before calling connect()? * Note that we alternate this value each for each remote port, to * illustrate the differences in behaviour between setting it or not. */ int reuse = 1; int port_loop; for (port_loop=0; port_loop<num_ports; port_loop++) { /* Set up a listening socket on the next remote port number. */ int server = open_server(port); int i=0; for(; i<num_iterations; i++) { /* Open a bunch of sockets; and bail out on the first failure. */ if (cycle_client(server, i, port, reuse)) { break; } } /* How many connections did we manage to establish on this port * number (and with this "reuse" setting)? If all is working, * we ought to be able to establish as many connections as there * are ephemeral ports, and we ought to be able to do so for each * remote port number (baring memory exhaustion problems). */ fprintf(stderr, "port %d; reuse %d; opened %d\n", port, reuse, i); /* Advance to the next remote port, and toggle whether we set * SO_REUSEADDR. */ port++; reuse = !reuse; } return 0; }
Responsible Changed From-To: freebsd-bugs->freebsd-net Over to maintainer(s).
Responsible Changed From-To: freebsd-net->andre Looking into it.
Here is a preliminary patch that I've developed against the 7.2 release. The majority of this diff is refactoring theport selection loop out of in_pcbbind_setup() and into in_pcb_lport(). This portion of the patch I largely lifted from the 9.0 release. This version of in_pcb_lport() does not have the #ifdefs for INET or INET6 as I did not want to get too distracted by those details on a first pass. The meat of the change can be summarized as follows: 1) tcp_connect() no longer calls in_pcbbind() explicitly; rather local port allocation is deferred to be done from in_pcbbonnect_setup(). 2) Within in_pcbconnect_setup() the call to in_pcbbind_setup() is changed to in_pcb_lport(). I could have left the call to in_pcbbind_setup(), but A) doing so would have meant a messy change to the function signature of in_pcbbind_setup(), and B) this particular call to in_pcbbind_setup() was being done *only* for local port allocation, so the bulk of the function body was being skipped anyway 3) in_pcb_lport() is augmented to take the fport/faddr to check for uniqueness, and also an "ephemeral" argument to determine whether to take fport/faddr into account. With this change, I'm able to allocate the full available ephemeral port range both with and without SO_REUSEADDR, and without needing to retry on EADDRINUSE. I can now open sockets until I exhaust memory/buffer space, rather than running into ephemeral port allocation problems. One important thing to note about this change is that it does *not* take into account the presence of sockets listening on a port number in the ephemeral port space (see the advice about port stealing in section 3.1 of RFC 6056). Prior to this patch, the logic would avoid listening sockets as it would avoid any re-use of a local port number. Keith
Assigning back to the pool as this seems like a valid problem from a cursory glance.
batch change: For bugs that match the following - Status Is In progress AND - Untouched since 2018-01-01. AND - Affects Base System OR Documentation DO: Reset to open status. Note: I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
I believe D24781 / rS361228 fixes this particular issue.
Mike believes that the following commits should fix the issue: https://svnweb.freebsd.org/base?view=revision&revision=361228 https://svnweb.freebsd.org/base?view=revision&revision=361231 which were as a result of https://reviews.freebsd.org/D24781 (which Mike had seen before noticing this PR). To submitter: I know that this is a very old bug, but, just in case, are you in a position to be able to test and verify?
^Triage: Assign to committer resolving.
> To submitter: I know that this is a very old bug, but, just in case, > are you in a position to be able to test and verify? Sorry, no. I no longer have access to the machines or environment where I encountered the problem.
(In reply to Keith Arner from comment #9) Hey Keith! Long time, no see! Hope you're doing well. :-) This was found internally at Panasas. I have pinged one of the current engineers who might be familiar with this, and will relay their response.
Stepped onto this, upgraded to 12.1-STABLE r364883 : it's now fixed. netstat -an | awk '{print $4}' | awk -F. '{print $5}' | sort | uniq -c | grep -v 1\ gives me 2 and 3's, and I was seeing only 1's on the 12.1-RELEASE.
This was fixed some time back, as confirmed in Comment 11.