Bug 203630 - [Hyper-V] [nat] [tcp] 10.2 NAT bug in TCP stack or hyperv netsvc driver
Summary: [Hyper-V] [nat] [tcp] 10.2 NAT bug in TCP stack or hyperv netsvc driver
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Wei Hu
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2015-10-08 03:01 UTC by Ken Camann
Modified: 2016-07-08 15:09 UTC (History)
9 users (show)

See Also:
decui: maintainer-feedback-
koobs: mfc-stable10+
koobs: mfc-stable9?


Attachments
Revert TSO and checksum offloading patch r285236 in Netvsc driver (65.88 KB, patch)
2015-10-14 04:37 UTC, Wei Hu
no flags Details | Diff
Only disable checksum offloading on 10.2 (2.03 KB, patch)
2015-10-16 08:04 UTC, Wei Hu
no flags Details | Diff
Fix a checksum offloading bug in Hyper-V netvsc driver. (503 bytes, patch)
2015-11-04 10:42 UTC, Wei Hu
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ken Camann 2015-10-08 03:01:43 UTC
I have encountered a bug in FreeBSD 10.2 (and also -CURRENT) when using NAT with either pf or ipfw. My setup for the gateway host is:

* Microsoft Client Hyper-V on Windows 10 host machine
* FreeBSD 10.2 Release (no upgrades or updates)
* Two network interfaces, hn0 (the LAN "private switch") and hn1 (the gateway "external switch")
* A simple pf.conf:
nat on hn1 inet from hn0:network to any -> (hn1)
pass all

I tried the equivalent for ipfw, i.e., setting firewall_type to "open" and the and nat interface to hn1.

Both configurations work fine in FreeBSD 10.1 Release, using the exact same Hyper-V setup. On FreeBSD 10.2 (and -CURRENT), connections to the Internet from the gateway itself are working, but other VMs forwarding through the gateway from the LAN while using NAT does not work. I have done some basic investigation, including disabling the checksum and TSO offloading options (via ifconfig) that were added to the netsvc driver for 10.2 (in R285236), but that didn't help.

Whatever it is, it is in a common code path shared by pf and ipfw, or perhaps the netsvc driver. In looking around the Internet, I saw a few unanswered posts (which predate 10.2) about pf mysteriously dropping state and TCP connections entering the SYN_SENT:CLOSED state immediately. That is the symptom I see in 10.2. The outbound NAT translation is successful, and tcpdump shows the packets being sent out of the external interface. But then nothing else happens (no response from the server seems to come back), and the state is dropped. This problem is easy for me to reproduce; it happens on any new Hyper-V VM I create with 10.2 Release, and likewise it always works fine with 10.1 Release.
Comment 1 Oleksandr Kryvulia 2015-10-08 12:28:46 UTC
Try to add to loader.conf and reboot
hw.vtnet.csum_disable=1

It helps me with similar configuration (10.2) under Windows Server 2012 R2
Comment 2 Ken Camann 2015-10-08 22:59:27 UTC
Hi Alexandr,

Those are tunables for the vtnet driver, the virtio-based virtual network driver. Hyper-V has its own (netsvc) and doesn't use the vtnet. In some VM programs like VirtualBox, you have a choice to use virtio if you want it, but I don't see that option in Client Hyper-V. Are you using Hyper-V with vtnet somehow? I have never used Windows Server Hyper-V, but I know if it offers different options than the Hyper-V that comes with Windows Professional Edition, some of which relate to networking. I tried adding the tunable anyway, but it didn't do anything. Do you have a vtnet0 interface?
Comment 3 Oleksandr Kryvulia 2015-10-09 08:57:34 UTC
Sorry for my mistake. You are right - vm that I mean is running on KVM. I have some vm's running on Hyper-V but without nat. I can test it some later.
Comment 4 Wei Hu 2015-10-12 03:01:44 UTC
Somebody else also reported the similar issue on 10.2. Unfortunately I cannot find a way to reproduce it in house. Can you provide a detailed step for me to repro, such as the pf.conf file, and NAT config in detail?

Also are you using Vlan? Thanks.
Comment 5 Eddy 2015-10-13 00:04:19 UTC
I encounter the same problem in a slightly different configuration:

- Hyper-V 2012, hosting:
-- FreeBSD 10.2 x64 with MPD acting as a PPTP server
-- Windows 7
- Clients on the LAN (misc OS), inclding an old FreeBSD 6.1
- Clients connected to FreeBSD 10.2 VM by PPTP (MPD5) using misc OS as well

Firewalls turned off.

* TCP, UDP, ICMP work:
 - between all PPTP clients
 - between a PPTP client, FreeBSD and the Windows 7 virtual machine
 - between FreeBSD and local machines on the LAN

* TCP doesn't work between a PPTP client and machines on the LAN.

I tried to investigate by opening a TCP connection on port 80 from a PPTP client to the old FreeBSD 6.1 on the LAN. There is a part of the tcpdump I ran on that old FreeBSD 6.1:

IP (tos 0x0, ttl 127, id 20111, offset 0, flags [DF], proto: TCP (6), length: 48) pc-ed.lan.domain.fr.56026 > srv-mandy.lan.domain.fr.http: S, cksum 0x84d6 (incorrect (-> 0x7c54), 2429810306:2429810306(0) win 8192 <mss 1354,nop,nop,sackOK>

I have read that having an incorrect checksum was normal, so I guess the problem doesn't come from that.

From what I saw, the problem occurs as long as the FreeBSD 10.2 is used as a gateway (NAT or not).

Let me know if I can do other tests to help you investigate the problem. Any help would be appreciated.
Comment 6 Wei Hu 2015-10-14 04:37:40 UTC
Created attachment 162011 [details]
Revert TSO and checksum offloading patch r285236 in Netvsc driver
Comment 7 Wei Hu 2015-10-14 04:38:56 UTC
If you have the test environment and can try something, can you apply the attached patch on the 10.2 server and see if the problem still occurs? The patch is a revert of r285236, which I suspect may be the culprit. But I don't have environment to reproduce.
Comment 8 Eddy 2015-10-14 16:49:41 UTC
Everything seems to work with the patch.

This is what I did:

- Create a clean new VM with FreeBSD 10.2 on the Hyper-V server.
- Activated IP forwarding: sysctl net.inet.ip.forarding=1
- On another computer (same LAN, running Windows 10): set the default gateway to the new FreeBSD test VM. Ping/tracert to the internet work. TCP doesn't work.

- Patch netvsc with the r285236 file you provided, in the /usr/src/sys/dev/hyperv/netvsc/ folder (patch -i r285236)
- Rebuild and install the kernel, then reboot.
- TCP works from the LAN machines.

Thanks Wei!

Please note that I couldn't test it in a PPTP or NAT configuration.

Now I wait for the patch to be included in the next FreeBSD update (since I usually don't build custom kernel).
Comment 9 Wei Hu 2015-10-16 08:03:15 UTC
(In reply to Eddy from comment #8)

> This is what I did:
>
> - Create a clean new VM with FreeBSD 10.2 on the Hyper-V server.
> - Activated IP forwarding: sysctl net.inet.ip.forarding=1
> - On another computer (same LAN, running Windows 10): set the default gateway to the new FreeBSD test VM. Ping/tracert to the internet work. TCP doesn't work.

In above setting, how can pinging from Windows 10 machine to internet work? The machine in the internet doesn't have routing knowledge to send the packet back to Windows 10 client which is inside LAN. 

Are you using NAT on the FreeBSD 10.2 server? When I enabled the NAT, everything seems working in on 10.2 as a gateway. 

So overall, I think the r285236 is the cause of the problem. However, since I still cannot reproduce and r285236 is a big change, I cannot narrow down to smaller part for sure. 

We come up with a suspecting code path. Attached is another patch which you can test for us. Please apply this directly on clean 10.2 code (not on the patch I attached earlier.) This new patch just disabled the checksum offloading. See if this one can help solve the issue you are seeing.
Comment 10 Wei Hu 2015-10-16 08:04:39 UTC
Created attachment 162111 [details]
Only disable checksum offloading on 10.2
Comment 11 Eddy 2015-10-16 13:21:15 UTC
I have a separate NAT router between the VM and the Internet, but not on the FreeBSD 10.2 server:

PC-LAN-WIN10 <------> FREEBSD 10.2 VM  <------> NAT_ROUTER  <------> INTERNET

I added the NAT router as a default route on the FreeBSD test VM before doing the tests:

# route add default 192.168.1.254

I just tried to build a new kernel with the last "disable_csum_20151016.patch" you provided but I am stuck with an error:

/usr/src/sys/dev/hyperv/netvsc/hv_rndis_filter.c:828:11 error: unused variable `dev` [-Werror,-Wunused-variable]
	device_t dev = device->device;
		 ^
Comment 12 Wei Hu 2015-10-19 04:02:15 UTC
(In reply to Eddy from comment #11)

>I just tried to build a new kernel with the last "disable_csum_20151016.patch" you provided but I am stuck with an error:

>/usr/src/sys/dev/hyperv/netvsc/hv_rndis_filter.c:828:11 error: unused variable `dev` [-Werror,-Wunused-variable]
	device_t dev = device->device;
		 ^

You can just comment out this line since this variable is not used after applying the the patch. Let me know how it goes.
Comment 13 Eddy 2015-10-19 09:18:57 UTC
(In reply to Wei Hu from comment #12)

After some tests, "disable_csum_20151016.patch" doesn't solve the issue for me. The last r285236 patch worked.

Do I have to first apply the r285236 patch and then the disable_csum_20151016.patch? I only applied the new patch on clean sources.
Comment 14 Wei Hu 2015-11-04 10:42:43 UTC
Created attachment 162763 [details]
Fix a checksum offloading bug in Hyper-V netvsc driver.

Do not calculate TCP checksum when the receiving bits in csum_flags are set.
Comment 15 Wei Hu 2015-11-04 10:44:50 UTC
Sorry for the late response.  We still cannot reproduce the issue, but another customer reported the same issue and found a bug in the Hyper-V checksum path. Attached is a patch to fix this issue. Please apply on a clean 10.2 kernel, rebuild and see if this fixes the problem you are seeing. Let us know if this works or not.
Comment 16 Eddy 2015-11-05 16:25:42 UTC
Hi and thank you for the patch.

I just applied it on a clean 10.2 kernel and made the same tests as before.

It seems to solve the issue!

I wait for it to be officially included in the next official updates.
Comment 17 Wei Hu 2015-11-22 05:29:33 UTC
The fix went into Head as r291156. I will merge to 10 stable branch in a week.
Comment 18 Eddy 2015-12-07 15:20:59 UTC
Hi Wei,

Is the patch merged to stable branch?

Thank you.
Comment 19 Wei Hu 2015-12-08 01:27:29 UTC
(In reply to Eddy from comment #18)

Not yet. I will try to do it this week.
Comment 20 Kubilay Kocak freebsd_committer freebsd_triage 2015-12-08 01:30:04 UTC
(In reply to Eddy from comment #18)

@Eddy / Wei

When commits logs contain a line containing "PR: <issueid>[, <issueid>], it will be referenced as a comment against those issue id's.

Also, committers will set the mfc-stable* flag to + when done/committed, or to - with a comment as to why an MFC to that branch is not necessary/invalid.
Comment 21 Dexuan Cui 2015-12-15 10:54:57 UTC
I think Wei has merged the fix (r285785 in Head) to both stable/10 and releng/10.2:

https://svnweb.freebsd.org/base?view=revision&revision=285928
https://svnweb.freebsd.org/base?view=revision&revision=286058


I think we can close the bug.
Comment 22 Kubilay Kocak freebsd_committer freebsd_triage 2015-12-15 11:16:32 UTC
Assign to committer that resolved.

@Weh, if this is not relevant to stable/9, please set mfc-stable9 to - with comment
Comment 23 Eddy 2015-12-18 09:06:52 UTC
@Dexuan Cui, I'm not sure the patch is merged. The two revisions you mentioned are made the 28th and 30th of July 2015, whereas the working patch was provided by Wei the 4th of November.
Comment 24 Dexuan Cui 2015-12-18 09:54:28 UTC
(In reply to Eddy from comment #23)
Hi Eddy, hmm, I am sorry -- I was looking at the wrong patch...

I'll ask the committers to help to merge the correct patch to stable/10.
Comment 25 commit-hook freebsd_committer freebsd_triage 2015-12-18 14:56:58 UTC
A commit references this bug:

Author: royger
Date: Fri Dec 18 14:56:49 UTC 2015
New revision: 292439
URL: https://svnweb.freebsd.org/changeset/base/292439

Log:
  MFC r291156:

  Ignore the inbound checksum flags when doing packet forwarding in netvsc
  driver.

  Sponsored by:	Microsoft OSTC
  PR:		203630

Changes:
_U  stable/10/
  stable/10/sys/dev/hyperv/netvsc/hv_netvsc_drv_freebsd.c
Comment 26 Dexuan Cui 2015-12-18 15:24:19 UTC
Hi Eddy, roger has merged the fix to stable/10.
Comment 27 Eddy 2015-12-18 15:46:45 UTC
(In reply to Dexuan Cui from comment #26)

Thank you Dexuan and royger!
Comment 28 Franco Fichtner 2016-02-04 10:23:57 UTC
Hello,

We've run into this too over at OPNsense. This is a harsh regression from 10.1 to 10.2. It needs an errata for 10.2.


Thank you,
Franco
Comment 29 Eddy 2016-02-04 11:39:55 UTC
Hello everybody,

The issue was fixed with patch r291156. I tested it on a clean FreeBSD install by recompiling the kernel in a test environment and it worked.

It was merged to the STABLE 10 branch (Fri Dec 18 14:56:49 UTC 2015). I assume that the latest build include the fix, however I'm running 10.2-RELEASE-p12 on my production server but the problem still occurs.
Comment 30 Dexuan Cui 2016-02-14 05:33:33 UTC
(In reply to Franco Fichtner from comment #28)
@Franco, thanks for the reminder! We're trying to contact the FreeBSD releasing team and make an errata for 10.2 as you suggested.

(In reply to Eddy from comment #29)
@Eddy, it turns out the fix is only in the Head and stable/10, but not in the releng/10.2. :-(
Now we're working with the FreeBSD releasing team on this too.

BTW, whu has left our team, but we're always be reachable by the 2 mails on the page https://wiki.freebsd.org/HyperV.
Comment 31 Kubilay Kocak freebsd_committer freebsd_triage 2016-02-14 07:07:55 UTC
Pending request/response to merge this to the releng/10.2 branch for 10.3-RELEASE
Comment 32 Dexuan Cui 2016-03-17 01:03:02 UTC
(In reply to Dexuan Cui from comment #30)
Hi Franco and Eddy,
The errata is out here:
https://www.freebsd.org/security/advisories/FreeBSD-EN-16:05.hv_netvsc.asc

The fix for releng/10.2 is on the branch:
https://github.com/freebsd/freebsd/commits/releng/10.2
https://github.com/freebsd/freebsd/commit/8938078969b7348e3de72f4bc9377ad163e62abf

So you can update to 10.2 RELEASE-p14 to get the fix.
Comment 33 Franco Fichtner 2016-03-17 06:57:46 UTC
Brilliant, thank you. All looks fine. :)
Comment 34 Dexuan Cui 2016-03-23 12:45:07 UTC
The bug is fixed in 10.2 RELEASE-p14, 10.3 and 11-CURRENT.
The bug doesn't exist in 10.1.
I think we can close the bug now.