Bug 173444 - socket: IPV6_USE_MIN_MTU and TCP is broken
Summary: socket: IPV6_USE_MIN_MTU and TCP is broken
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 8.3-STABLE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Michael Tuexen
URL:
Keywords: needs-qa, patch
Depends on:
Blocks:
 
Reported: 2012-11-07 14:30 UTC by marka
Modified: 2019-01-25 15:29 UTC (History)
4 users (show)

See Also:
tuexen: mfc-stable11+
tuexen: mfc-stable10-


Attachments
kern173444 (1.61 KB, patch)
2013-01-25 01:48 UTC, marka
no flags Details | Diff
kern173444-rev2 (2.14 KB, patch)
2013-03-20 03:51 UTC, marka
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description marka 2012-11-07 14:30:01 UTC
Setting IPV6_USE_MIN_MTU to one (1) on a IPv6 TCP socket results
in fragmented IPv6 packets being sent rather than the TCP segment
size being adjusted to reflect the MTU limit on the socket.

00:56:44.177930 IP6 2001:470:1f00:820:218:f3ff:feba:9a37 > 2001:470:1f00:820:6233:4bff:fe01:7585: frag (0|1232) 5555 > 63656: Flags [.], ack 42, win 8211, options [nop,nop,TS val 2829969063 ecr 1028520077], length 1200
00:56:44.177936 IP6 2001:470:1f00:820:218:f3ff:feba:9a37 > 2001:470:1f00:820:6233:4bff:fe01:7585: frag (1232|228)
00:56:44.177953 IP6 2001:470:1f00:820:218:f3ff:feba:9a37 > 2001:470:1f00:820:6233:4bff:fe01:7585: frag (0|1232) 5555 > 63656: Flags [.], ack 42, win 8211, options [nop,nop,TS val 2829969063 ecr 1028520077], length 1200
00:56:44.177957 IP6 2001:470:1f00:820:218:f3ff:feba:9a37 > 2001:470:1f00:820:6233:4bff:fe01:7585: frag (1232|228)
00:56:44.177974 IP6 2001:470:1f00:820:218:f3ff:feba:9a37 > 2001:470:1f00:820:6233:4bff:fe01:7585: frag (0|1232) 5555 > 63656: Flags [.], ack 42, win 8211, options [nop,nop,TS val 2829969063 ecr 1028520077], length 1200

Fix: 

The TCP layer should check whether ip6po_minmtu is set on
	the socket and adjust the maxmtu appropriately.  The code
	fragment below should do that but has not been tested.

sys/netinet/tcp_input.c:

        if (isipv6) {
                struct ip6_pktopts *opt;
                maxmtu = tcp_maxmtu6(&inp->inp_inc, mtuflags);
                opt = inp->inp_depend6.inp6_outputopts;
                if (opt && opt->ip6po_minmtu)
                        maxmtu = min(maxmtu, IPV6_MMTU);
                tp->t_maxopd = tp->t_maxseg = V_tcp_v6mssdflt;
        } else
How-To-Repeat: 
Apply the following patch to named and transfer a zone.
[The intent of the patch is to avoid PMTUD issues.  Too
many nameservers are behind load balancers / firewalls that
don't pass PTB messages.]

diff --git a/lib/isc/unix/socket.c b/lib/isc/unix/socket.c
index ffe7e02..6fb8860 100644
--- a/lib/isc/unix/socket.c
+++ b/lib/isc/unix/socket.c
@@ -2262,6 +2264,31 @@ clear_bsdcompat(void) {
 }
 #endif
 
+static void
+use_min_mtu(isc__socket_t *sock) {
+#if !defined(IPV6_USE_MIN_MTU) && !defined(IPV6_MTU)
+	UNUSED(sock);
+#endif
+#ifdef IPV6_USE_MIN_MTU
+	/* use minimum MTU */
+	if (sock->pf == AF_INET6) {
+		int on = 1;
+		(void)setsockopt(sock->fd, IPPROTO_IPV6, IPV6_USE_MIN_MTU,
+				(void *)&on, sizeof(on));
+	}
+#endif
+#if defined(IPV6_MTU)
+	/*
+	 * Use minimum MTU on IPv6 sockets.
+	 */
+	if (sock->pf == AF_INET6) {
+		int mtu = 1280;
+		(void)setsockopt(sock->fd, IPPROTO_IPV6, IPV6_MTU,
+				 &mtu, sizeof(mtu));
+	}
+#endif
+}
+
 static isc_result_t
 opensocket(isc__socketmgr_t *manager, isc__socket_t *sock,
 	   isc__socket_t *dup_socket)
@@ -2426,6 +2453,11 @@ opensocket(isc__socketmgr_t *manager, isc__socket_t *sock,
 	}
 #endif
 
+	/*
+	 * Use minimum mtu if possible.
+	 */
+	use_min_mtu(sock);
+
 #if defined(USE_CMSG) || defined(SO_RCVBUF)
 	if (sock->type == isc_sockettype_udp) {
 
@@ -2490,32 +2522,6 @@ opensocket(isc__socketmgr_t *manager, isc__socket_t *sock,
 		}
 #endif /* IPV6_RECVPKTINFO */
 #endif /* ISC_PLATFORM_HAVEIN6PKTINFO */
-#ifdef IPV6_USE_MIN_MTU        /* RFC 3542, not too common yet*/
-		/* use minimum MTU */
-		if (sock->pf == AF_INET6 &&
-		    setsockopt(sock->fd, IPPROTO_IPV6, IPV6_USE_MIN_MTU,
-			       (void *)&on, sizeof(on)) < 0) {
-			isc__strerror(errno, strbuf, sizeof(strbuf));
-			UNEXPECTED_ERROR(__FILE__, __LINE__,
-					 "setsockopt(%d, IPV6_USE_MIN_MTU) "
-					 "%s: %s", sock->fd,
-					 isc_msgcat_get(isc_msgcat,
-							ISC_MSGSET_GENERAL,
-							ISC_MSG_FAILED,
-							"failed"),
-					 strbuf);
-		}
-#endif
-#if defined(IPV6_MTU)
-		/*
-		 * Use minimum MTU on IPv6 sockets.
-		 */
-		if (sock->pf == AF_INET6) {
-			int mtu = 1280;
-			(void)setsockopt(sock->fd, IPPROTO_IPV6, IPV6_MTU,
-					 &mtu, sizeof(mtu));
-		}
-#endif
 #if defined(IPV6_MTU_DISCOVER) && defined(IPV6_PMTUDISC_DONT)
 		/*
 		 * Turn off Path MTU discovery on IPv6/UDP sockets.
@@ -3313,6 +3319,11 @@ internal_accept(isc_task_t *me, isc_event_t *ev) {
 		NEWCONNSOCK(dev)->connected = 1;
 
 		/*
+		 * Use minimum mtu if possible.
+		 */
+		use_min_mtu(NEWCONNSOCK(dev));
+
+		/*
 		 * Save away the remote address
 		 */
 		dev->address = NEWCONNSOCK(dev)->peer_address;
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2012-11-08 23:48:35 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).
Comment 2 Andre Oppermann freebsd_committer freebsd_triage 2012-11-08 23:58:17 UTC
Responsible Changed
From-To: freebsd-net->andre

Take over.
Comment 3 marka 2013-01-25 01:48:27 UTC
If IP6PO_MINMTU_ALL is set it should also impact mss negotiation.


Comment 4 marka 2013-03-20 03:51:41 UTC
Need to account for ipv6 and tcp header sizes in advertised mss.
Comment 5 Hiren Panchasara freebsd_committer freebsd_triage 2016-12-22 20:08:51 UTC
Mark, can you please check if this is still a problem?
Assigning back to the pool.
Comment 6 marka 2017-03-30 14:10:10 UTC
(In reply to Hiren Panchasara from comment #5)
My test system died years ago but I believe that it still is a problem.

It should be trivial to check.

create a IPv6 TCP socket.
set IPV6_USE_MIN_MTU=1 using setsockopt
connect to a data sink
write 1400 bytes to the socket in a single operation

Examine the packets sent with tcpdump.  There should be no fragmented
packets being sent as TCP is supposed to take into account MTU
information.

Mark
Comment 7 Andrey V. Elsukov freebsd_committer freebsd_triage 2017-03-30 18:21:09 UTC
(In reply to marka from comment #6)
> (In reply to Hiren Panchasara from comment #5)
> My test system died years ago but I believe that it still is a problem.
> 
> It should be trivial to check.
> 
> create a IPv6 TCP socket.
> set IPV6_USE_MIN_MTU=1 using setsockopt
> connect to a data sink
> write 1400 bytes to the socket in a single operation
> 
> Examine the packets sent with tcpdump.  There should be no fragmented
> packets being sent as TCP is supposed to take into account MTU
> information.

According to RFC3542 this is what the kernel should do - do IP fragmentation as application requested.

https://tools.ietf.org/html/rfc3542#section-11.1

"If the packet is larger than the minimum MTU and this feature has been enabled the IP layer will fragment to the minimum MTU."
Comment 8 Andrey V. Elsukov freebsd_committer freebsd_triage 2017-03-30 18:28:42 UTC
And this is what always pisses me off. If we have 10/50/100G link with 9k MTU, bind always does IPv6 fragmentation due to this option.
Comment 9 marka 2017-03-30 20:46:15 UTC
(In reply to Andrey V. Elsukov from comment #7)
RFC 6691

o  As a result, when the effective MTU of an interface varies, TCP
   SHOULD use the smallest effective MTU of the interface to calculate
   the value to advertise in the MSS option.

IPV6_USE_MIN_MTU=1 changes the effective MTU of the interface for this
socket.
Comment 10 marka 2017-03-30 20:46:41 UTC
(In reply to Andrey V. Elsukov from comment #8)
So what!  Most DNS/TCP response is a few of packets.  What does it
matter if it is the 3 or 4 packets.

What matters is avoiding PMTUD as it is NOT reliable.  Setting the
IPv6 packet size to 1280 avoids triggering PMTUD issues.  Limiting
the packet size avoids timeout and retransmissions due to PTB not
been generated due to rate limiting or being lost due to stupid
load balancers and firewalls that drop ICMP.

Go put your validating resolvers behind a IPv6 in IPv4 link then
come back and say this is not needed.
Comment 11 Andrey V. Elsukov freebsd_committer freebsd_triage 2017-03-31 01:02:36 UTC
(In reply to marka from comment #9)
> (In reply to Andrey V. Elsukov from comment #7)
> RFC 6691
> 
> o  As a result, when the effective MTU of an interface varies, TCP
>    SHOULD use the smallest effective MTU of the interface to calculate
>    the value to advertise in the MSS option.
> 
> IPV6_USE_MIN_MTU=1 changes the effective MTU of the interface for this
> socket.

This is socket option and it doesn't change interface's MTU value and doesn't affect MSS value, as I see. It just instructs the kernel explicitly do IPv6 fragmentation exactly as described in the RFC3542.
Comment 12 Andrey V. Elsukov freebsd_committer freebsd_triage 2017-03-31 01:10:38 UTC
(In reply to marka from comment #10)
> (In reply to Andrey V. Elsukov from comment #8)
> So what!  Most DNS/TCP response is a few of packets.  What does it
> matter if it is the 3 or 4 packets.

Zone transfers need a lot of such few packets.
 
> What matters is avoiding PMTUD as it is NOT reliable.  Setting the
> IPv6 packet size to 1280 avoids triggering PMTUD issues.  Limiting
> the packet size avoids timeout and retransmissions due to PTB not
> been generated due to rate limiting or being lost due to stupid
> load balancers and firewalls that drop ICMP.
> 
> Go put your validating resolvers behind a IPv6 in IPv4 link then
> come back and say this is not needed.

When I build the network in the DC, I know better what MTU can be used in my network. And forcing 1280 bytes size for the network, where 9k is the default MTU is at least strange in the 2017.
Comment 13 marka 2017-03-31 02:50:37 UTC
(In reply to Andrey V. Elsukov from comment #11)

Read the words "effective MTU" that I quoted.  The "effective MTU" is 1280 with this option set.
Comment 14 Michael Tuexen freebsd_committer freebsd_triage 2017-08-14 21:08:35 UTC
I think the TCP mss should honor the IPV6_USE_MIN_MTU option, the same way as SCTP should use the minmtu. We should also disable PMTU when the socket option is set.
Will put a corresponding patch into phabricator...
Comment 15 Eitan Adler freebsd_committer freebsd_triage 2018-05-23 10:27:30 UTC
batch change of PRs untouched in 2018 marked "in progress" back to open.
Comment 16 Michael Tuexen freebsd_committer freebsd_triage 2018-08-18 23:06:05 UTC
A patch is under review: https://reviews.freebsd.org/D16796
Comment 17 commit-hook freebsd_committer freebsd_triage 2018-08-21 14:12:43 UTC
A commit references this bug:

Author: tuexen
Date: Tue Aug 21 14:12:31 UTC 2018
New revision: 338138
URL: https://svnweb.freebsd.org/changeset/base/338138

Log:
  Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
  socket resulted in sending fragmented IPV6 packets.

  This is fixes by reducing the MSS to the appropriate value. In addtion,
  if the socket option is set before the handshake happens, announce this
  MSS to the peer. This is not stricly required, but done since TCP
  is conservative.

  PR:			173444
  Reviewed by:		bz@, rrs@
  MFC after:		1 month
  Sponsored by:		Netflix, Inc.
  Differential Revision:	https://reviews.freebsd.org/D16796

Changes:
  head/sys/netinet/in_pcb.h
  head/sys/netinet/tcp_input.c
  head/sys/netinet/tcp_subr.c
  head/sys/netinet/tcp_usrreq.c
Comment 18 Oleksandr Tymoshenko freebsd_committer freebsd_triage 2019-01-20 03:19:36 UTC
Michael, can this PR be closed?

Thanks
Comment 19 Michael Tuexen freebsd_committer freebsd_triage 2019-01-20 09:16:35 UTC
(In reply to Oleksandr Tymoshenko from comment #18)
I think I should MFC that to stable/11... Then I'll close it. Will do that next week.
Comment 20 commit-hook freebsd_committer freebsd_triage 2019-01-25 15:26:52 UTC
A commit references this bug:

Author: tuexen
Date: Fri Jan 25 15:25:54 UTC 2019
New revision: 343432
URL: https://svnweb.freebsd.org/changeset/base/343432

Log:
  MFC r338138:

  Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
  socket resulted in sending fragmented IPV6 packets.

  This is fixes by reducing the MSS to the appropriate value. In addtion,
  if the socket option is set before the handshake happens, announce this
  MSS to the peer. This is not stricly required, but done since TCP
  is conservative.

  PR:			173444
  Reviewed by:		bz@, rrs@
  Sponsored by:		Netflix, Inc.
  Differential Revision:	https://reviews.freebsd.org/D16796

Changes:
_U  stable/11/
  stable/11/sys/netinet/in_pcb.h
  stable/11/sys/netinet/tcp_input.c
  stable/11/sys/netinet/tcp_subr.c
  stable/11/sys/netinet/tcp_usrreq.c
Comment 21 Michael Tuexen freebsd_committer freebsd_triage 2019-01-25 15:29:09 UTC
MFCed to stable/11, but will not MFC to stable/10, since FreeBSD 10.x is EOL.