Port autoselection on connect() without bind() (or with bind() with zero
sin_port) sometimes works wronly and gives already busy local port
number that will lead to EADDRINUSE on connection attempt. This all
happens when jails used.
How to fix:
src/sys/netinet/in_pcb.c, in_pcb_lport() function
calls to in_pcblookup_local() should have last argument NULL, not cred
that's because here we are not about getting some access but about
avoiding port number conflicts, so all inpcb's should be taken in account
This all applies to FreeBSD 9.x, 10.x and HEAD (possibly older versions too).
diff -ur src-svn/10.2/sys/netinet/in_pcb.c src/sys/netinet/in_pcb.c
--- src-svn/10.2/sys/netinet/in_pcb.c 2016-03-15 22:58:38.088511000 +0300
+++ src/sys/netinet/in_pcb.c 2016-05-20 14:51:43.340568000 +0300
@@ -452,7 +452,7 @@
tmpinp = in_pcblookup_local(pcbinfo, laddr,
- lport, lookupflags, cred);
+ lport, lookupflags, NULL /*cred*/);
} while (tmpinp != NULL);
Created attachment 172001 [details]
patch to fix incorrect EADDRINUSE
Although this change does not seem wrong, it seems necessary only when using the same IP address in different jails, or in a jail and the outer system. Is that the case?
Also, if the change applies to the in_pcblookup_local() call, it also applies to the in6_pcblookup_local() call above it.
(In reply to Jilles Tjoelker from comment #2)
Yes, got that error with multiple jails on the same IP.
And most likely yes, it applies to in6 branch too (uploaded patch above contains in6).
Can this be processed more fast to make it patched in upcoming 10.4?
bumping as well :)
Created attachment 196816 [details]
11.2-STABLE Patch for EADDRINUSE
I've tested this patch on 11.1 and 11.2 and it solves the EADDRINUSE error we were seeing with multiple jails with the same IP address
Have also been running jails with different IP's on the same box and didn't see any new errors at all
Attached the updated patch for 11.2-STABLE
if someone could try to merge this back it would be great!
The analysis of this bug and the patch provided seem correct to my eye. Adding some folks who have been poking around in the network stack at the moment to advise.
Stupid question: can someone write the short test case for this to reproduce the problem?
Not a script as such, however, to reproduce i create 2 jails with same IP,
reduce the ephemeral port range to increase the probability of hitting the issue
and run the following script in each jail and compare the logs, i'm sure there is an easy way to do this with shell, but I couldn't get the low level error with curl
r = requests.get(<% choose a local URL %>)
except Exception as e:
for i in range(0,10000):
t = threading.Thread(target=makerequests)
missed a zero on net.inet.ip.portrange.hilast: 52000
What is the semantic of the last argument of in_pcblookup_local()?
trying to summarise to get the exact case right as the suggested patch looks not quite right. There are too many (corner) cases to consider.
two jails, same single IP address.
In each jail a program tries to establish a connection and has bound a local source address or not, but must not have bound a local port number.
On connect() to a local or remote address and port there may be a case that two applications in two different jails get an implicit bind to the same local port number out of which one succeeds and one fails? So one connect call succeeds and one fails?
It is not yet fully understood if the same could possibly happen between the base system and a jail, in which case it is assumed that the connect() inside the jail would be the one always failing?
I'll take the bug for now at least
(In reply to Bjoern A. Zeeb from comment #14)
> trying to summarise to get the exact case right as the suggested patch looks not quite right
I don't understand what's wrong with the patch.
> There are too many (corner) cases to consider.
All of them are covered by that single check: busy ports should be detected by system-wide used ports list, not jailed used ports list.
> In each jail a program tries to establish a connection and has bound a local source address or not, but must not have bound a local port number.
> On connect() to a local or remote address and port there may be a case that two applications in two different jails get an implicit bind to the same local port number out of which one succeeds and one fails? So one connect call succeeds and one fails?
No. Second implicit bind fails itself (searching "non-busy" port - found actually busy port - try to bind - fail) and throws a error through connect() that tried it.
> It is not yet fully understood if the same could possibly happen between the base system and a jail, in which case it is assumed that the connect() inside the jail would be the one always failing?
Yes, it can, when the implicit bind happens in jail. Already busy port can be anywhere outside that jail, so it may be in other jail on in host system.
This bug is very easy to fix, why not to do it?
@All Please don't *solely* bump issues, unless providing additional information beyond what already exists.
- Assignee timeout (> 1 year), reset assignee
@Reporter: Can you please provide:
- /etc/rc.conf / /etc/jail.conf configuration/setup that reproduces the issue (as an attachment)
An updated patch against CURRENT/head would be handy (please obsolete existing patches when attaching the updated one)
I am reasonably sure that this bug is fixed in head and 12-STABLE. The change that fixed it (r361228) has not been backported to 11-STABLE. Can anyone verify that the bug does not exist in head or 12-STABLE?
- Assign to committer that (reportedly) resolved
- Track merges / non-merges
@Mike Do you have the reference stable/12 merge revision handy? Just add a comment with 'base r<revision>' to have it auto link
base r362446 in stable/12 should have the fix.
^Triage: Close resolved.
@Reporter, If the issue is still reproducible after updating to a FreeBSD version/revision that contain the fix, please re-open this issue with additional information