Bug 242744

Summary: IPSec in transport mode between FreeBSD hosts blackholes TCP traffic
Product: Base System Reporter: Victor Sudakov <vas>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Open ---    
Severity: Affects Some People CC: ae, archit, bz, dewayne, eugen, franco, it.support, julian, luke.hamburg, m.muenz, p.zeddi, vas
Priority: ---    
Version: 12.1-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
net.inet.ipsec.trans.cleardf
none
ESP from Windows server to FreeBSD none

Description Victor Sudakov 2019-12-20 18:41:02 UTC
When you configure transport mode IPSec between two FreeBSD hosts (no tunnels or if_ipsec), TCP connectivity between those hosts breaks. It happens because a) ESP packets are always generated with the DF flag set, b) PMTUD does not work in IPSec transport mode because there is no interface (?) c) when TCP segments of standard size are encapsulated into ESP packets, the resulting oversized ESP packets cannot pass through any interface with MTU=1500, nor can they be fragmented because of the DF flag, so they are just blackholed and never leave the host.

How to reproduce. Configure a simple transport mode IPSec between two FreeBSD hosts and try to scp files from one host to another. The file transfer will inevitably stall, until you clear all IPSec policies. Watch with tcpdump: all ESP packets have the DF flag set, but large ESP packets will be missing.

A workaround. A host route to the peer with "-mtu 1400" can be configured as described in https://lists.freebsd.org/pipermail/freebsd-net/2019-December/054952.html but it is not scalable.

What is to be done. ESP packets should not have the DF flags set by default for things to "just work."

I've checked that the net.inet.ipsec.dfbit does not affect transport mode. Regardless of its value, the DF flag is always on.
Comment 1 Victor Sudakov 2019-12-20 18:42:09 UTC
All the above seems to be valid for IPv6 too.
Comment 2 dewayne 2019-12-20 21:10:44 UTC
Victor - are you using asymmetric keys between hosts during this phase of testing, or are you using pre-shared keys?  If asymmetric, can you try with insecure keys (so the key can be transmitted within a packet)?
Comment 3 Victor Sudakov 2019-12-21 02:33:27 UTC
(In reply to dewayne from comment #2)
I don't quite understand the second part of your question, the problem is not within ISAKMP. ISAKMP works fine. The problem with TCP begins later, when all SA are already established.

But answering your question, the test lab uses preshared keys, racoon with a vanilla configuration.
Comment 4 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 08:16:45 UTC
There are multiple ways to solve this problem that work just fine for FreeBSD 11 at least.

First, one can use IPSec transport mode combined with gif tunnel and mtu=1500 for the gif. Oversized IPv4 gif packets have DF bit set to 0, as per gif(4) manual page, so they get fragmented while being transmitted over path with lowest intermediate mtu 1500 or less and no packet drops occur.

Second, one can try sysctl net.inet.ipsec.dfbit=0 that is documented in ipsec(4) manual page for IPSec tunnel mode but maybe it works for transport mode, too. Check it out. Maybe, you can switch your IPSec to tunnel mode.

Third, you can adjust TCP MSS by means of packet filters. For example, ipfw currently has additional kernel module ipfw_pmod.ko and command ipfw tcp-setmss.
Comment 5 Victor Sudakov 2019-12-21 08:33:50 UTC
(In reply to Eugene Grosbein from comment #4)

> First, one can use IPSec transport mode combined with gif tunnel and mtu=1500 for the gif. 

The solution with gif or if_ipsec tunnels is not scalable if you want to create a mesh of hosts with protected traffic between them. If we are talking about not more than 2-3 hosts, then the if_ipsec solution is the most elegant. 

> Second, one can try sysctl net.inet.ipsec.dfbit=0 that is documented in 
> ipsec(4) manual page for IPSec tunnel mode 
> but maybe it works for transport mode, too

I wrote in the initial problem description that this sysctl does not work for transport mode. You just did not pay attention.

> Third, you can adjust TCP MSS by means of packet filters. 

I don't think I can if the packet in question is not received or transmitted via any interface (like locally generated ssh-client traffic intercepted by IPSec policies). Or I'll try if you provide an example of matching such a packet.

I also tried pf's "scrub out proto 50 no-df" but there was no match.

In a FreeBSD - Windows 7 combination, this kind of transport mode works transparently out of the box. I think Windows knows to adjust MSS, or something.
Comment 6 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 08:51:56 UTC
OTOH, RFC 2401 Appendix B https://tools.ietf.org/html/rfc2401#page-1-48 states that packets generated by IPSec transport mode must be allowed to fragment over the path and this is incompatible with current behaviour keeping DF=1 for TCP and may be an error in our IPSEC stack. Adding ae@ to CC: list.

Andrey, what is your opinion on the problem? Should we clear DF bit unconditionally for outgoing IPv4 IPSec transport mode packets?
Comment 7 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 08:56:25 UTC
(In reply to Victor Sudakov from comment #5)

> I don't think I can if the packet in question is not received or transmitted
> via any interface (like locally generated ssh-client traffic intercepted
> by IPSec policies).

Any outgoing packet has its destination IP address and it is not changed by IPSec transport mode. It's possible to perform routing lookup for any reachable destination IP address to discover transmit MTU and deduce right MSS.
Comment 8 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 09:04:39 UTC
(In reply to Victor Sudakov from comment #5)

>In a FreeBSD - Windows 7 combination, this kind of transport mode works 
> transparently out of the box. I think Windows knows to adjust MSS, or something.

Can you enable some TCP service at FreeBSD side (f.e. inetd/echo or ftpd) and check it out if Windows sets DF=1 for initial encrypted TCP SYN when you connect from Windows to FreeBSD over such IPSec transport mode configuration?
Comment 9 Victor Sudakov 2019-12-21 09:08:50 UTC
(In reply to Eugene Grosbein from comment #7)

> It's possible to perform routing lookup for any reachable destination IP address to discover transmit MTU and deduce right MSS.

Yes, this (or similar) advice was given in https://lists.freebsd.org/pipermail/freebsd-net/2019-December/054952.html It works (I checked) but does not scale.
Comment 10 Victor Sudakov 2019-12-21 09:12:54 UTC
(In reply to Eugene Grosbein from comment #8)
> check it out if Windows sets DF=1 for initial encrypted TCP SYN

My FreeBSD - Windows7 IPSec configuration is gone with my Windows7 workstation. If it helps the cause, I can recreate with Windows 10 or Windows 2016 server, it will take some time though because I don't remember very well well how you set up SPD on Windows, it was somewhat non-trivial.
Comment 11 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 12:14:11 UTC
(In reply to Victor Sudakov from comment #9)

It does scale: with racoon, you can use phase1 up-script to create specific routes with -mtu 1400 automatically.
Comment 12 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 12:15:28 UTC
(In reply to Victor Sudakov from comment #10)

Windows 7 should be fine. I don't think newer versions of Windows have a regression dealing with DF bit.
Comment 13 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 13:02:13 UTC
(In reply to Victor Sudakov from comment #5)

> Or I'll try if you provide an example of matching such a packet.

This works for me:

ipfw add tcp-setmss 1418 tcp from any to 'table(1)' tcpflags syn out
ipfw add tcp-setmss 1418 tcp from 'table(1)' to any tcpflags syn in
Comment 14 Victor Sudakov 2019-12-21 17:41:28 UTC
(In reply to Eugene Grosbein from comment #11)
> you can use phase1 up-script to create specific routes

A clever idea. A host route to $REMOTE_ADDR via... via what? Maybe sourcing rc.conf for $defaultrouter would be sufficient in most cases.

Your idea about ipfw. Can it really match locally created packets not passing via any interface?
Comment 15 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 23:24:28 UTC
(In reply to Victor Sudakov from comment #14)

Routing lookup can be performed within shell script, too:

gw=$(route -n get "$REMOTE_ADDR" | awk '/gateway: / {print $2}')

As for ipfw. First, ipfw never requied matching on some interface name, this is optional. Second, every outgoing locally generated packet has its outgoing interface anyway including targeted to same host, these go out via lo0.
Comment 16 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-21 23:56:57 UTC
Created attachment 210122 [details]
net.inet.ipsec.trans.cleardf

For testing: new sysctl net.inet.ipsec.trans.cleardf is zero by default. If set to 1, it forces clearing DF bit for outgoing encrypted transport mode packets.
Comment 17 Victor Sudakov 2019-12-22 08:58:34 UTC
(In reply to Eugene Grosbein from comment #16)
Eugene, could you make "no DF bit" the default behavior? Without the DF bit, transport mode will work "out of the box" in a vanilla configuration.

If someone somewhere needs the DF flag on ESP packets, no doubt for obscure and sinister purposes, they could reenable it for themselves.

Remember that net.inet.ipsec.dfbit=0 by default.
Comment 18 dewayne 2019-12-22 23:04:09 UTC
(In reply to Eugene Grosbein from comment #16)
I thought that there was a convention regarding sysctl naming format.  Should 
net.inet.ipsec.trans.cleardf be net.inet.ipsec.trans_cleardf, or are there plans for the trans sub-branch?

As it might help people coming into ipsec in the future. Is it possible to have a crisp (clear) description that distinguishes 
net.inet.ipsec.trans.cleardf: "Clear do not fragment bit for outgoing transport mode packets."
and
net.inet.ipsec.dfbit=Do not fragment bit on encap.

Suggestion
net.inet.ipsec.dfbit="Do not fragment bit on tunnel encap."
                                             ^
  
(I'd personally prefer net.inet.ipsec.tunnel_cleardf, and obsolete, in the future,  ipsec.dfbit as it doesn't do as currently stated. Perhaps worth consideration?)
Comment 19 Eugene Grosbein freebsd_committer freebsd_triage 2019-12-23 04:15:24 UTC
(In reply to dewayne from comment #18)

Yes, the sysctl is somewhat misnamed but it's for testing only, not considered as permanent solution. I still wait for testing results from Victor. If we get good results and agreement with other developers, we ought just clear DF unconditionally.
Comment 20 Victor Sudakov 2019-12-24 14:20:58 UTC
(In reply to Eugene Grosbein from comment #15)
I've made a quick and dirty script which I run from the remote block.

It seems that this workaround does work.

#!/bin/sh

if echo $REMOTE_ADDR | grep -q ":" ; then
        gw=$(route -6 -n get "$REMOTE_ADDR" | awk '/gateway: / {print $2}')
        else
        gw=$(route -4 -n get "$REMOTE_ADDR" | awk '/gateway: / {print $2}')
fi

case "${1}" in
phase1_up)
        route add -host $REMOTE_ADDR -mtu 1200 $gw
        ;;
*)
        route delete -host $REMOTE_ADDR
        ;;
esac
Comment 21 Andrey V. Elsukov freebsd_committer freebsd_triage 2019-12-24 14:31:23 UTC
I have the not yet fully thought idea how to fix this problem. I'll try to implement it during coming holidays.

There are still unimplemented IPsec method IPSEC_CTLINPUT and unused hdrsz field in the struct inpcbpolicy. We can use them to handle inbound ICMP NEEDFRAG messages and adjust required room for TCP protocol.
Comment 22 Victor Sudakov 2019-12-24 15:28:38 UTC
(In reply to Eugene Grosbein from comment #8)
> Can you enable some TCP service at FreeBSD side (f.e. inetd/echo or ftpd)
> and check it out if Windows sets DF=1 for initial encrypted TCP SYN 
> when you connect from Windows to FreeBSD over such IPSec transport 
> mode configuration?

I've finally found time to do that. 192.168.3.80 is a Windows 2012 server, 192.168.3.1 is FreeBSD with daytime and ftpd services enabled. As you see from the packet dump, all ESP packets have the DF flag set.
Comment 23 Victor Sudakov 2019-12-24 15:29:32 UTC
Created attachment 210202 [details]
ESP from Windows server to FreeBSD
Comment 24 Victor Sudakov 2019-12-25 03:03:25 UTC
(In reply to Eugene Grosbein from comment #19)
> I still wait for testing results from Victor. 
> If we get good results and agreement with other developers, we ought just clear DF unconditionally.

I'm beginning to feel that the solution is not as simple as clearing the DF flag unconditionally. Windows does not do that as seen from the packet dump I attached yesterday https://bugs.freebsd.org/bugzilla/attachment.cgi?id=210202 , and still FTP from a Windows host works over a IPSec transport mode.
Comment 25 Victor Sudakov 2019-12-25 12:55:36 UTC
The more I think of it, the more I feel that the idea of removing the DF flag from ESP packets is incorrect. Because in IPv6, there is no flag to remove. If an IPv6 packet was not fragmented by the originator, there is nothing to be done in transit.
Comment 26 Bjoern A. Zeeb freebsd_committer freebsd_triage 2020-01-11 11:59:21 UTC
I think anyone looking into this should look at the later RFCs not the original ones:

https://tools.ietf.org/html/rfc4301 section 4.1 and appendix D.1 might be the best starting points.  Searching the document for "transport" will yield more easily.
Comment 27 Victor Sudakov 2020-01-12 07:06:17 UTC
(In reply to Bjoern A. Zeeb from comment #26)

Bjoern, can you formulate in a few own words what behavior you deem appropriate in accordance with the later RFCs? 

I can only say that what we have now is completely broken: you enable IPSec transport mode between FreeBSD hosts on your LAN (very easy and elegant with strongswan, as it turns out) and bummer! Your TCP does not work any more.
Comment 28 Julian Elischer freebsd_committer freebsd_triage 2020-01-12 09:52:40 UTC
A few years ago I used the multiple routing tables to have different MTU of one table, which was used for procsses that were going to use tunnels or ipsec.

I can't remember the details of how I forced  it but My memory was that the tunnels went to table 1 which was 1500 and everything s went to the default table which was 1400.
Comment 29 Victor Sudakov 2020-01-12 11:09:51 UTC
(In reply to Julian Elischer from comment #28)

> I used the multiple routing tables to have different MTU

This is one of the workarounds and we have even discussed something similar in the comments, but should not IPsec "just work" out of the box? That should be our goal.
Comment 30 Pawel Zeddi 2022-05-04 11:43:33 UTC
Any update on the issue?

I have 2 OPNSense (Freebsd) machines connected via IPSec tunnel and packets above some size are dropped from time to time.

IPSec should work "out of the box".
Comment 31 Andrey V. Elsukov freebsd_committer freebsd_triage 2022-05-04 14:59:38 UTC
(In reply to Pawel Zeddi from comment #30)

The proposed IPSEC_CTLINPUT() was committed in https://reviews.freebsd.org/D30992
But I did not track which versions contain this code. Probably, some issues related tp MTU and DF-bit should be solved with this commit.
Comment 32 Pawel Zeddi 2022-05-05 08:12:17 UTC
(In reply to Andrey V. Elsukov from comment #31)

Patch is for FreeBSD 14 which means OPNSense will update to this version in few years. Very inconvenient (to put it mildly).
Comment 33 Michael Muenz 2022-05-05 14:16:34 UTC
Sorry Franco for adding you to the loop, maybe you can have a look at the patch.
May be worth to backport to OPNsense kernel :)