Summary: | tcp connect() can return invalid EADDRINUSE (Eg: multiple jails with the same IP address) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | aler | ||||||||||
Component: | kern | Assignee: | Mike Karels <karels> | ||||||||||
Status: | Closed FIXED | ||||||||||||
Severity: | Affects Some People | CC: | bcmills, bz, dev, emaste, jilles, karels, king.c.david, koobs, markj, net, pi, rgrimes, sbruno, tuexen | ||||||||||
Priority: | --- | Flags: | koobs:
mfc-stable12+
koobs: mfc-stable11- |
||||||||||
Version: | CURRENT | ||||||||||||
Hardware: | Any | ||||||||||||
OS: | Any | ||||||||||||
Attachments: |
|
Description
aler
2016-06-30 19:12:01 UTC
Created attachment 172001 [details]
patch to fix incorrect EADDRINUSE
Although this change does not seem wrong, it seems necessary only when using the same IP address in different jails, or in a jail and the outer system. Is that the case? Also, if the change applies to the in_pcblookup_local() call, it also applies to the in6_pcblookup_local() call above it. (In reply to Jilles Tjoelker from comment #2) Yes, got that error with multiple jails on the same IP. And most likely yes, it applies to in6 branch too (uploaded patch above contains in6). Can this be processed more fast to make it patched in upcoming 10.4? Bump. bumping as well :) Created attachment 196816 [details]
11.2-STABLE Patch for EADDRINUSE
I've tested this patch on 11.1 and 11.2 and it solves the EADDRINUSE error we were seeing with multiple jails with the same IP address Have also been running jails with different IP's on the same box and didn't see any new errors at all Attached the updated patch for 11.2-STABLE if someone could try to merge this back it would be great! The analysis of this bug and the patch provided seem correct to my eye. Adding some folks who have been poking around in the network stack at the moment to advise. Stupid question: can someone write the short test case for this to reproduce the problem? Not a script as such, however, to reproduce i create 2 jails with same IP, reduce the ephemeral port range to increase the probability of hitting the issue net.inet.ip.portrange.hifirst: 51000 net.inet.ip.portrange.hilast: 5200 and run the following script in each jail and compare the logs, i'm sure there is an easy way to do this with shell, but I couldn't get the low level error with curl #!/usr/local/bin/python import requests import logging import threading logging.basicConfig(filename='/tmp/python.log',level=logging.INFO) def makerequests(): try: r = requests.get(<% choose a local URL %>) except Exception as e: logging.info(e) for i in range(0,10000): t = threading.Thread(target=makerequests) t.start() missed a zero on net.inet.ip.portrange.hilast: 52000 What is the semantic of the last argument of in_pcblookup_local()? Ok, trying to summarise to get the exact case right as the suggested patch looks not quite right. There are too many (corner) cases to consider. two jails, same single IP address. In each jail a program tries to establish a connection and has bound a local source address or not, but must not have bound a local port number. On connect() to a local or remote address and port there may be a case that two applications in two different jails get an implicit bind to the same local port number out of which one succeeds and one fails? So one connect call succeeds and one fails? It is not yet fully understood if the same could possibly happen between the base system and a jail, in which case it is assumed that the connect() inside the jail would be the one always failing? I'll take the bug for now at least (In reply to Bjoern A. Zeeb from comment #14) > trying to summarise to get the exact case right as the suggested patch looks not quite right I don't understand what's wrong with the patch. > There are too many (corner) cases to consider. All of them are covered by that single check: busy ports should be detected by system-wide used ports list, not jailed used ports list. > In each jail a program tries to establish a connection and has bound a local source address or not, but must not have bound a local port number. Yes. > On connect() to a local or remote address and port there may be a case that two applications in two different jails get an implicit bind to the same local port number out of which one succeeds and one fails? So one connect call succeeds and one fails? No. Second implicit bind fails itself (searching "non-busy" port - found actually busy port - try to bind - fail) and throws a error through connect() that tried it. > It is not yet fully understood if the same could possibly happen between the base system and a jail, in which case it is assumed that the connect() inside the jail would be the one always failing? Yes, it can, when the implicit bind happens in jail. Already busy port can be anywhere outside that jail, so it may be in other jail on in host system. bump This bug is very easy to fix, why not to do it? @All Please don't *solely* bump issues, unless providing additional information beyond what already exists. ^Triage: - Assignee timeout (> 1 year), reset assignee @Reporter: Can you please provide: - /etc/rc.conf / /etc/jail.conf configuration/setup that reproduces the issue (as an attachment) @Anyone An updated patch against CURRENT/head would be handy (please obsolete existing patches when attaching the updated one) I am reasonably sure that this bug is fixed in head and 12-STABLE. The change that fixed it (r361228) has not been backported to 11-STABLE. Can anyone verify that the bug does not exist in head or 12-STABLE? ^Triage: - Assign to committer that (reportedly) resolved - Track merges / non-merges @Mike Do you have the reference stable/12 merge revision handy? Just add a comment with 'base r<revision>' to have it auto link base r362446 in stable/12 should have the fix. ^Triage: Close resolved. @Reporter, If the issue is still reproducible after updating to a FreeBSD version/revision that contain the fix, please re-open this issue with additional information On https://golang.org/issue/34264 we're seeing what appears to match the symptoms of this bug on FreeBSD 12.2, but only when connecting from an IPv6 address to an IPv4 one. The failure does not reproduce very often, but it occurs frequently enough to be detected on the Go project's builders. The most current builder image exhibiting the bug is running FreeBSD 12.2-RELEASE-p6. Is that version expected to include this fix? According to the comment at https://go-review.googlesource.com/c/go/+/369157/1#message-c6b6ac6857673b20daacd7a25019c8817f6f836f this fix was included in the 12.2 and 13.0 releases used by the Go project, so the failures on 12.2 builders observed in https://golang.org/issue/34264#issuecomment-856946712 seem to imply that the fix was incomplete: we are still seeing EADDRINUSE for connect calls connecting from an IPv6 local address to an IPv4 remote. Created attachment 232948 [details] patch for 12.x 13.x 14-CURRENT > connecting from an IPv6 address to an IPv4 one. How do you connect from IPv6 to IPv4? It giving me EINVAL (12.3) or EAFNOSUPPORT (14) when I try to call bind() or connect() with IPv4 address on IPv6 socket. And that's check at the very beginning of tcp6_usr_connect(): > if (nam->sa_family != AF_INET6) > return (EAFNOSUPPORT); > if (nam->sa_len != sizeof (*sin6)) > return (EINVAL); giving no chance to handle IPv4 address with it. Do you using jails? I found a way to still trigger EADDRINUSE from connect() from a jail in 12.3 and 14-CURRENT, may be that's your case: bind(fd,{0.0.0.0:0}) and then do connect (that's all for a pure IPv4 socket). Since the issue you linked contains "connect from wildcard IPv6", most likely it was really wildcard-bound IPv4 (because connecting from IPv6 to IPv4 is not possible at all). Adding a patch to fix that (that's the same patch that was posted 6 years ago but very slightly modified for freebsd 12+). Created attachment 232951 [details] test program Test program First, reduce portrange to hit the problem faster: sysctl net.inet.ip.portrange.first=10000 sysctl net.inet.ip.portrange.last=10004 Compile: cc -o test test.c Open two screens, first: ./test 10 1.2.3.4 0.0.0.0 Second: jail / x your.real.ip.address csh cd /your/dir ./test 10 1.2.3.4 0.0.0.0 with unpatched system, second program will show (EADDRINUSE) > connect() error 48 (Address already in use) with patched, it will show the proper error (EADDRNOTAVAIL) > bind() error 49 (Can't assign requested address) |