Bug 180873

Summary: [sctp] SCTP connection hangs on COOKIE_ECHOED
Product: Base System Reporter: jau
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Only Me CC: markj, tuexen
Priority: Normal    
Version: 9.2-BETA1   
Hardware: Any   
OS: Any   

Description jau 2013-07-26 09:40:00 UTC
SCTP connections using IPv6 hang on the third leg of the connection setup when
trying to connect to the system itself over lo0.

Netstat shows these lines about the pending/hanging connect...

sctp46 1to1  fe80::225:90ff:f.6042  ::1.41368              ESTABLISHED
             fe80::225:90ff:f.6042
             ::1.6042
             fe80::1.6042
sctp46 1to1  ::1.41368              ::1.6042               COOKIE_ECHOED

Tcpdump (tcpdump -i lo0) trace collected during the SCTP/IPv6 connection
setup shows this...

09:35:21.999609 IP6 ::1.41140 > ::1.6042: sctp (1) [INIT] [init tag: 1501306277] [rwnd: 1864135] [OS: 10] [MIS: 2048] [init TSN: 2226957883] 
09:35:22.000383 IP6 ::1.6042 > ::1.41140: sctp (1) [INIT ACK] [init tag: 3322565489] [rwnd: 1864135] [OS: 10] [MIS: 2048] [init TSN: 1843245874] 
09:35:22.000870 IP6 ::1.41140 > ::1.6042: sctp (1) [COOKIE ECHO] 
09:35:23.000476 IP6 ::1.41140 > ::1.6042: sctp (1) [COOKIE ECHO] 
09:35:25.000820 IP6 ::1.41140 > ::1.6042: sctp (1) [COOKIE ECHO] 
09:35:29.000743 IP6 ::1.41140 > ::1.6042: sctp (1) [COOKIE ECHO] 

Running the exact same connection test while using IPv4 addresses
there is no problem. Everything works just fine.
While using IPv6 the connect() call eventually ends with timeout.
So, it seems something in the SCTP implementation is not exactly the same
when the IP layer is IPv6 compared to the case when the IP layer is IPv4.

The really odd thing about this is that this exact result only happens
when the server is binding to multiple addresses and the client is using
an implicit wild card bind while calling connect() without an explicit bind.
If I change the test client to use explicit sctp_bindx() calls for all
suitable local addresses the connection still hangs but netstat shows me
this...

sctp46 1to1  fe80::1.44586          ::1.6042               COOKIE_WAIT
             ::1.44586              
             fe80::225:90ff:f.44586 
             fe80::225:90ff:f.44586 
sctp46 1to1  fe80::1.6042                                  LISTEN
             ::1.6042               
             fe80::225:90ff:f.6042  
             fe80::225:90ff:f.6042  

At the same time tcpdump shows me no SCTP packets via lo0 at all.
Needless to say maybe, but with explicit bind on the client side
and IPv4 addresses everything works just fine.

So, there is some sort of disparity between how SCTP connection setup
works on top of IPv6 and IPv4.

Fix: 

Not the foggiest idea yet.
How-To-Repeat: I am not quite sure of the conditions that trigger this odd behavior.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2013-07-26 22:45:01 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).
Comment 2 Michael.Tuexen 2013-07-27 14:23:41 UTC
Which addresses are you binding? Are the programs you use
available?
I'm pretty sure the problem is related to the address scopes
in IPv6. I haven't done testing link local addresses at all,
I think.

Best regards
Michael
Comment 3 Michael Tuexen freebsd_committer freebsd_triage 2013-07-28 21:44:21 UTC
Responsible Changed
From-To: freebsd-net->tuexen

The problem seems to be SCTP specific.
Comment 4 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:00:30 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2020-07-10 16:38:35 UTC
Is it known whether this is still a problem?  Is there any sample reproducible that could be used to verify?
Comment 6 jau 2020-07-10 18:14:46 UTC
The reason is known. When one binds the local addresses explicitly
one by one before calling connect() the system activates all of the
local addresses right after the INIT and INIT-ACK before COOKIE-ECHO
and COOKIE-ACK. Because at this phase the system  has not seen any
other traffic between the endpoints but the couple of packets needed
for a successful INIT and INIT-ACK it only knows about one single pair
of operational addresses. Activating the other potential local addresses
at this phase causes the COOKIE processing being attempted using a
different local address which may not be routable at all to the one
known peer address.
The local addresses can be activated only when they have been used
for a successful INIT + INIT-ACK or they have been tested and proven
to be routable to at least one of the reported peer addresses. This
testing of functional address pairs works only through successful
pairs of HEARTBEAT + HEARTBEAT-ACK.
The current logic is opportunistic and wrong.
Comment 7 Michael Tuexen freebsd_committer freebsd_triage 2020-07-10 18:50:27 UTC
(In reply to jau from comment #6)
A way to fix it to rewrite the handling of the local addresses. At least this was the outcome of a discussion with rrs. It just requires a fair amount of changes and fixes a specific issue. The current address handling was optimised for end-points binding against the wildcard address.