Bug 207087

Summary: kernel: r295285 in 10.2-STABLE breaks OpenVPN functionality
Product: Base System Reporter: g_amanakis
Component: kernAssignee: George V. Neville-Neil <gnn>
Status: New ---    
Severity: Affects Only Me CC: KOT, ae, brooks, cy, garga, gnn, madpilot, mandree, melifaro, mgrooms, net, re, vangyzen
Priority: --- Keywords: needs-qa, patch, regression
Version: 10.2-STABLEFlags: koobs: mfc-stable10?
Hardware: amd64   
OS: Any   
URL: https://reviews.freebsd.org/D4042
Description Flags
Only use tryfoward() when pfilter hooks are not present
Copy the mbuf for use in icmp error messages. none

Description g_amanakis 2016-02-10 18:33:00 UTC
Created attachment 166844 [details]

r295285 in 10.2-STABLE breaks OpenVPN server functionality. Tested with OpenVPN 2.3.10 on amd64 bare-metal hardware with IPv4. Clients connected to the OpenVPN server experience slow IPv4 www traffic and connection resets. 

Clients connect via IPv4 UDP to the server, and in-kernel NAT is performed on the external interface. OpenVPN configs are attached. The kernel has IPSEC, IPSEC_NAT_T and VIMAGE enabled, and SCTP disabled.
Comment 1 g_amanakis 2016-02-10 18:33:24 UTC
Created attachment 166845 [details]
Comment 2 g_amanakis 2016-02-10 18:40:47 UTC
Also I just figured out that my Android devices which connect directly to the gateway running the OpenVPN server (they connect to the internal interface and not through OpenVPN) are not able to open regular webpages and some apps (eBay, Amazon, YouTube) stop functioning if this commit is applied.
Comment 3 mgrooms 2016-02-10 19:59:59 UTC
Recently I noticed that after upgrading two separate pairs of firewalls to 10.2-RELEASE that my ISAKMP deamons stopped negotiating SAs with peers. I just haven't gotten around to submitting a bug report yet. It only seems to happen when large UDP packets get fragmented due to large payloads ( ie. certificate info is transmitted during late in phase1 negotiation ). This may be unique to the bge driver or related hardware as the isakmp daemon started working again on both sets of firewalls once I disabled hardware checksum offload ( ifconfig bgeX -rxcsum ). This work-around wasn't required until the upgrade to 10.2-RELEASE, but I can't say if it was at a specific patch level. I can say that one set of firewalls were upgraded from 9.2-RELEASE-p?? and the other set were upgraded from a patched 10.0-RELEASE, so I assume the commit that broke UDP re-assembly was committed sometime between 10.0-RELEASE and 10.2-RELEASE-p11. Sorry I can't be more specific.

BTW, this isn't an attempt to hijack your problem report. I just thought that the issue you describe ( openvpn w/ UDP ) may be related to mine so I thought it would be worth mentioning. Have you tried disabling hw checksum offload on your public facing network device? If that improves the situation, it's quite possible that we are being bit by the same issue.
Comment 4 g_amanakis 2016-02-10 20:08:13 UTC
(In reply to mgrooms from comment #3)
This issue concerns only 10.2-STABLE (now 10.3-BETA1) which is about to become 10.3-RELEASE. The commit has not been applied to 10.2-RELEASE, so you must be facing another issue. The OpenVPN clients connect with no problems to the server. 

It has to do with ip_tryforward(). If I comment out this function in ip_input.c the symptoms resolve. Could it be that some of the traffic entering ip_tryforward() bypasses the NAT?
Comment 5 mgrooms 2016-02-10 20:51:24 UTC
I see. They underlying cause is quite possibly unrelated then. As I said, I wasn't trying to hijack your bug report. But the symptom still sounds similar in the respect that some of your UDP traffic ( your OpenVPN control traffic for example ) appears to be processed correctly, but other traffic ( your OpenVPN transport traffic being tunneled ) does not. That smacks of a re-assembly problem. In the latter case, you could have a large inner IP packet size due to the tunnel overhead which would cause the outer IP packet to be fragmented. This would lead to stalls and resets from the client perspective, just as you describe in your bug report.

However, that doesn't necessarily explain your 2nd problem where non-tunneled traffic stalls. You can't NAT fragmented packets if you have a re-assembly problem as the required UDP/TCP port values are only available in the initial packet of a fragmented chain. That usually only effects UDP packets but it can still be a problem for TCP if the TCP MSS is large enough as the DNF bit is typically set in the IP header.

In any case, good luck with your problem.
Comment 6 mgrooms 2016-02-10 20:55:13 UTC
Doah, sorry. I stopped and started writing that last paragraph while in the middle of something else. I was still thinking of things in terms of tunneling. Please disregard and I'll go away and be quiet now :)
Comment 7 Mark Linimon freebsd_committer freebsd_triage 2016-02-10 23:29:45 UTC
Assign to committer of 295285.
Comment 8 g_amanakis 2016-02-11 21:16:26 UTC
Kernels before this commit (e.g. r295264) with "net.inet.ip.fastforwarding=1" do not exhibit this symptoms.
Comment 9 George V. Neville-Neil freebsd_committer 2016-02-11 23:24:35 UTC
Can you try this without VIMAGE, and then possibly without IPSEC_NAT_T and tell me if the problem persists?  Also, can you share the output of netstat -s for all protocols including tcp, esp, ah ?
Comment 10 g_amanakis 2016-02-12 01:00:17 UTC
Created attachment 166885 [details]

Output of "netstat -s" attached.
In the local network the problem concerns primarily smartphones (I have an Android ecosystem) where some pages do not open at all. Commenting out the ip_tryforward() function resolves this.
Comment 11 g_amanakis 2016-02-12 01:49:24 UTC
I tried with IPSEC_NAT_T and VIMAGE disabled and it doesn't resolve it.
Comment 12 g_amanakis 2016-02-12 02:30:12 UTC
I did some thorough testing with a simplified IPFW ruleset (only in-kernel NAT enabled and allow everything on the local and WAN interfaces). Enabling "net.inet.ip.fastforwarding" in kernels before r295285 also exhibits the symptoms. Please disregard Comment #8 above.
Comment 13 g_amanakis 2016-02-12 03:03:46 UTC
Created attachment 166886 [details]

I did a tcpdump while an android client tries to access a webpage (www.gutefrage.net) while "net.inet.ip.fastforwarding" was on. I interrupted both dumps as soon as the client gave up trying to open the page.
Comment 14 George V. Neville-Neil freebsd_committer 2016-02-12 10:02:17 UTC
Thanks for all the updates, this does help to track some of this down.  A few more questions:

If you are not using an Android client does everything just work?

In your last test did you also turn off IPSEC and just use IPFW?  Can I see the IPFW ruleset you're using?

And, can I get a full pcap file rather than a text dump of the attempted session?
Comment 15 George V. Neville-Neil freebsd_committer 2016-02-12 10:13:46 UTC
Have you/can you test this on HEAD?
Comment 16 Brooks Davis freebsd_committer 2016-02-12 16:23:05 UTC
Remove freebsd-amd64 from cc.
Comment 17 g_amanakis 2016-02-12 16:49:45 UTC
Created attachment 166901 [details]

This is the simplified IPFW ruleset I am using. IPSEC is turned off in kernel compilation. I will use only this from now on in order to have a common basis. xxx.yyy and aaa.bbb are local networks. All the local clients are on the xxx.yyy network.

With this I am getting a mixed behaviour. For example my laptop client (Thinkpad X230 running Archlinux) exhibits the symptoms on some sites (most notably www.gutefrage.net) when the gateway runs the r295545 kernel (commenting out ip_tryforward() resolves it). However when the gateway runs the r295264 kernel with net.inet.ip.fastforwarding=1 the archlinux client doesn't exhibit the symptoms anymore. 

I will test this on HEAD. Is there any special tcpdump command you 'd like me to run? I will try and get simultaneous dumps from the interfaces involved.
Comment 18 George V. Neville-Neil freebsd_committer 2016-02-12 16:52:25 UTC
With tcpdump just use -w /tmp/capture.pcap so you get a file rather than text based output.
Comment 19 g_amanakis 2016-02-12 21:02:09 UTC
Created attachment 166908 [details]
Comment 20 g_amanakis 2016-02-12 21:06:20 UTC
Created attachment 166909 [details]

tun0.pcap and wan.pcap (gateway interfaces) were captured simultaneously. A client (archlinux) was connected over OpenVPN to the gateway running r295545. The simplified IPFW ruleset was used and www.gutefrage.net was accessed. The webpage did not load at all.

After the capture, I also tried lowering the MTU of the tun interface on the client from 1500 to 1212 and 1196 but this didn't resolve it.
Comment 21 g_amanakis 2016-02-13 16:52:21 UTC
The problem persists on HEAD (build 20160127).
Comment 22 g_amanakis 2016-02-14 22:58:49 UTC
Created attachment 167003 [details]
Comment 23 g_amanakis 2016-02-14 23:01:29 UTC
Created attachment 167004 [details]

I did another dump on a client on the local network (directly connected to gateway, no OpenVPN involved). The gateway ran 10.2-STABLE r295264 GENERIC. The symptoms when fastforwarding was enabled were the same as with r295285.

I did 2 dumps on the client: 
net.inet.fastforwarding=0 on the gateway ===> ffoff.pcapng ===> HTTP/GET happens at packet 10
net.inet.fastforwarding=1 on the gateway ===> ffon.pcapng ===> HTTP/GET happens at packet 36

The only significant difference I see is that when fastforwarding is turned off the gateway sends an ICMP Fragmentation needed to the client whereas when fastforwarding is on this doesn't happen, and the client keeps retransmitting the HTTP/GET packet. Could it be that the ip_fastfwd.c doesn't correctly send ICMP when the destination is unreachable and fragmentation is required?
Comment 24 George V. Neville-Neil freebsd_committer 2016-02-14 23:23:19 UTC
Thanks for the update and the new files.  I am trying to reproduce this on HEAD still.  With your latest test were you still using IPFW and NAT or was this just vanilla forwarding?  I have setup some test hosts in the lab

The setup is:
source   <->          router         <-> sink 
        1500                         576

and I'm doing a ping -s 1024 -D and I do see the MTU error returning on the source:

36 bytes from frag needed and DF set (MTU 576)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 041c 0000   0 0000  3f  01 debc

Your hardware addresses in the pcaps are obfuscated so its hard to tell whats happening at layer 2.
Comment 25 George V. Neville-Neil freebsd_committer 2016-02-15 00:16:21 UTC
(In reply to g_amanakis from comment #20)

Jumping back a bit. I definitely see data to your client on both interfaces in the tun and em0 traces.  Looks like the client is
Comment 26 g_amanakis 2016-02-15 01:28:11 UTC
(In reply to George V. Neville-Neil from comment #25)
Yes, correct. is the WAN-IP of the gateway. I used tcprewrite to spoof the mac addresses.
Comment 27 George V. Neville-Neil freebsd_committer 2016-02-15 02:03:30 UTC
You only see this with IPFW + NAT, right?  If you just use tryforward or, on older versions, fastforward, things are fine?
Comment 28 g_amanakis 2016-02-15 03:26:59 UTC
Correct, up until now the problem occured with IPFW and in-kernel NAT for IPv4. I will test using plain fastforwarding (without NAT on IPv4) and report.
Comment 29 George V. Neville-Neil freebsd_committer 2016-02-15 13:35:01 UTC
If you have a natd.conf file that would also be helpful.
Comment 30 g_amanakis 2016-02-15 13:57:08 UTC
I am using in-kernel NAT, you can see the configuration in the ipfw.txt attached above.
Comment 31 George V. Neville-Neil freebsd_committer 2016-02-15 20:47:53 UTC
Looking at the pcap files I see that the client is always advertising an MSS of 1460.  In your setup what are the MTUs of each interface involved?
Comment 32 g_amanakis 2016-02-16 01:07:07 UTC
(In reply to George V. Neville-Neil from comment #31)
MTU is 1500 on all interfaces (on WAN and LAN interface on the gateway, as well as on the client).
Comment 33 George V. Neville-Neil freebsd_committer 2016-02-16 01:15:32 UTC
Really?  Then why is there the "packet too large" ICMP message?
Comment 34 g_amanakis 2016-02-16 02:16:19 UTC
The only hypothesis I have is that when fragmentation is needed for an outgoing packet (I have no idea why) and the client sending this packet is behind NAT, the gateway cannot see the real IP of the client in order to send him the ICMP-fragmentation-required because the icmp_error() occurs after the outgoing packet has gone through the pfil hooks (and ipfw).

Can someone watching this report reproduce the symptoms using IPFW+NAT?
Comment 35 g_amanakis 2016-02-16 03:35:18 UTC
I just did a:
$ route get 
and got:
   route to: google-public-dns-a.google.com
destination: default
       mask: default
        fib: 0
  interface: em0
 recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
       0         0         0         0       576         1         0 

# netstat -i
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts 
em0    1500 <Link#1>      00:aa:bb:cc:dd:ee   136920     0     0   103864 
em0       - fe80::225:90f fe80::225:90ff:fe      190     -     -      107     
em0       - 2001:558:6020 2001:558:6020:167      108     -     -       96     
em0       - c-69-251-143-153.     4555     -     -     4982     

em0 is the WAN-interface. Why is there this discrepancy? 576 versus 1500?
Comment 36 g_amanakis 2016-02-16 03:39:23 UTC
I figured it out:
the dhcpcd changed the MTU of em0 each time it acquired a lease.
Setting "#option interface_mtu" in dhcpcd.conf leaves the MTU at 1500.
I think this resolves the whole thing.
I am going to test it right now and report back.
Comment 37 g_amanakis 2016-02-16 03:59:01 UTC
Setting dhcpcd to ignore the interface MTU resolves my problem. 

However if I manually reduce the MTU the problem reappears and the client receives no fragmentation-needed-ICMP. I am leaving this to the discretion of George.
Comment 38 g_amanakis 2016-02-16 14:42:50 UTC
I think the problem lies here:

if (ip_off & IP_DF) {                                                           
            0, mtu);                                                        
   goto consumed;
} else {      

By the time the icmp_error() happens, m has gone through the firewall (see "Step 5:" in ip_fastfwd.c, meaning that outgoing NAT has already happened and that the source address of has already been changed to reflect the one of the gateway. Thus when the icmp_error() takes place the ICMP is not sent to the client.

Is this correct?
Comment 39 George V. Neville-Neil freebsd_committer 2016-02-16 14:51:16 UTC
(In reply to g_amanakis from comment #38)

That does look suspicious.  In the ip_forward() routine we make a copy of the mbuf first.  I will look at a patch that synchronizes the way these work.

I'd like to ask about your various MTUs.  Are the mismatched across any of the links?  I ask because I am trying to get the code to misbehave here and I have had a hard time getting that to happen.  In a simple, 3 host, test I'm trying this:

source -> router -> sink

MTU   1500      576

When you say "em0 is set to 576, where, in your setup, does that exist?
Comment 40 g_amanakis 2016-02-16 15:16:55 UTC
client -> LAN-router-WAN -> webserver (eg. gutefrage.net)
1500      1500       576    1500?

client MTU:
interface: 1500
route: 1500

router MTU:
LAN-interface: 1500
LAN-route: 1500
WAN-interface (em0): 1500
WAN-route: 576 (set by dhcpcd when run on WAN-interface)

webserver MTU: probably 1500. I don't know this for sure.

Does this help?
Comment 41 George V. Neville-Neil freebsd_committer 2016-02-16 15:21:17 UTC
(In reply to g_amanakis from comment #40)

Yes, it does.

Also, without IPFW and NAT, that is if you can make this a regular routing setup, do you see the problem?  My theory is that you will not, and that it requires the packet to go through IPFW to show the issue.
Comment 42 g_amanakis 2016-02-16 17:43:49 UTC
You are correct, I can confirm this.

On this setup without NAT involved (ipfw was set to pass all):

client --> LAN-router-LAN --> server
1500       1500       576     1500

I can see the client getting an ICMP-fragmentation-required from the router when it tries to access the server on the other side. Thus, the client can access the server.
Comment 43 Guido Falsi freebsd_committer 2016-02-16 18:13:27 UTC
(In reply to g_amanakis from comment #34)


My home router is a nanobsd image I just updated to 10.3:

10.3-BETA2 FreeBSD 10.3-BETA2 #0 r295652: Tue Feb 16 10:09:07 CET 2016

It's running openvpn, ipfw and nat, I connected with my laptop (running head) via openvpn and had no problems. I just ran a few basic things: ssh, http, transferred a few files with those protocols and had no problems.

I'm not sure about the MTUs, booth connections are residential ADSL, so I guess both use 1492 on the WAN level, 1500 in the LAN.

One more difference is that the OpenVPN package was compiled in a poudriere 10.2 jail, not on the machine itself and not in 10.3, but this should not make a difference imho.

Not sure if this helps in some way, I can't make too many tests, but if something specific is needed I can get to do it.
Comment 44 George V. Neville-Neil freebsd_committer 2016-02-17 16:42:32 UTC
Created attachment 167113 [details]
Only use tryfoward() when pfilter hooks are not present

This is a patch against HEAD that I'm testing.  It ought to also apply against 10-STABLE though with an offset.  It bypasses tryforward() when there are pfil hooks present which will prevent issues from rewritten packets not having error reports generated.
Comment 45 g_amanakis 2016-02-17 23:11:02 UTC
The patch resolves the OpenVPN bug. (tested with the above ipfw.txt ruleset and OpenVPN config files).

I will report in a couple of hours if it also resolves the bug in a direct LAN connection.
Comment 46 g_amanakis 2016-02-18 01:12:20 UTC
This also resolves the bug in a direct LAN connection.
Comment 47 Andrey V. Elsukov freebsd_committer 2016-02-18 04:48:16 UTC
(In reply to George V. Neville-Neil from comment #44)
> Created attachment 167113 [details]
> Only use tryfoward() when pfilter hooks are not present
> This is a patch against HEAD that I'm testing.  It ought to also apply
> against 10-STABLE though with an offset.  It bypasses tryforward() when
> there are pfil hooks present which will prevent issues from rewritten
> packets not having error reports generated.

With this patch we will lost the tryforward's goal (fastforwarding by default) for routers where firewall is present. I guess the most of routers has firewall.
Comment 48 George V. Neville-Neil freebsd_committer 2016-02-18 13:57:12 UTC
(In reply to Andrey V. Elsukov from comment #47)

It turns out that for this bug fastforward (the predecessor to tryforward) would never have worked either.  I am working up an alternate fix and testing it now, but the issue is now time.  This bug is holding up the 10.3 release.
Comment 49 Andrey V. Elsukov freebsd_committer 2016-02-18 14:06:33 UTC
(In reply to George V. Neville-Neil from comment #48)
> It turns out that for this bug fastforward (the predecessor to tryforward)
> would never have worked either.  I am working up an alternate fix and
> testing it now, but the issue is now time.  This bug is holding up the 10.3
> release.

But for those for whom fastforwarding worked (i.e. IPSEC is disabled and ipfw is enabled), now it will never work. I think it is easiest and better to revert this MFC for 10.3 and properly fix it in the head/.
Comment 50 George V. Neville-Neil freebsd_committer 2016-02-18 14:24:44 UTC
Created attachment 167150 [details]
Copy the mbuf for use in icmp error messages.
Comment 51 Eric van Gyzen freebsd_committer 2016-02-18 16:57:12 UTC
Comment on attachment 167150 [details]
Copy the mbuf for use in icmp error messages.

In the "Copy the mbuf" patch, some paths seem to either double-free or leak an mbuf.  I can comment on specific lines, if you'd like.
Comment 52 George V. Neville-Neil freebsd_committer 2016-02-18 18:58:50 UTC
I am now tracking an updated patch in Phabricator:


That's where the rest of this will be carried out.
Comment 53 Matthias Andree freebsd_committer 2016-11-09 22:08:21 UTC
Comment 54 Guido Falsi freebsd_committer 2016-11-10 10:56:33 UTC
(In reply to Matthias Andree from comment #53)
> ping?

I still have the same router based on NanoBSD, in the while I updated the image to 11.0-RELEASE. As before everything is working fine for me, and I'm not seeing this problem.

But most probably my configuration differs from the reporter's one.