Bug 235607 - Incorrect checksums with NAT on vtnet with offloading
Summary: Incorrect checksums with NAT on vtnet with offloading
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-08 17:08 UTC by Jorge Schrauwen
Modified: 2019-11-19 05:23 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jorge Schrauwen 2019-02-08 17:08:14 UTC
### description

The issue only pops up when there is no valid tcp checksum present on the source traffic, it 'works' when the csum is valid. I did verify with the illumos people about if a bad checksum will be pass along as is or not with offloading enabled:

[17:03:15] <sjorge> rzezeski qq, if a guest with csum enabled sends a checksummed packet does it get discarded and recalculate when it hits the physical nic before it goes on the wire? Or is it kept as is?
[17:03:26] <sjorge> From my captures it seems to be kept as is?
[17:05:26] <rzezeski> sjorge: It depends on the guest OS. In illumos, either the IP stack calcs the checksum on Tx or places a flag on the dblk to have it calculated by hardware. If you do both then the hardware will calculate a second sum over the current one and it will be incorrect.
[17:05:42] <rzezeski> 1) <reply to diff question> 2) I have no idea how FBSD/Linux work in that regard, but they would have the same issue to deal with.

I beleive the later case is what is happening here, pf nat adds checksum based on empty initial checksum and then it gets pass as-is out the nic. That would also explain why a packet with a valid checksum that comes in will pass back out correctly after pf nat.

<-> = phsyical link, aka cat5/cat6 plugged into a NIC or Switch
<~> = loopback link, aka traffic between bhyve guests, between host and bhyve guest, ... stuff that never hit the MAC layer.
<.> = wireless link

This flow that is currently broken (only when pf nat is involved):

fbsd_guest1 <~> fbsd_guest_fw <-> switch(1) <-> modem
win10_guest <~> fbsd_guest_fw <-> switch(1) <-> modem

This flow is OK:

macbook <-> switch <->(2) fbsd_guest_fw <-> switch <-> modem
macbook <.> AP <-> switch <-> fbsd_guest_fw <-> switch <-> modem

1) using port-mirror on the switch I was able to confirm packets with a bad checksum end up on the wire
2) the bhyve guest has a vnic on the physical nic (they are considered bridge for the host OS)

#### workaround but comes at a performance hit
root@nattest:~ # ifconfig vtnet0 -rxcsum -txcsum -rxcsum6 -txcsum6 -tso4 -lro
root@nattest:~ # ifconfig vtnet1 -rxcsum -txcsum -rxcsum6 -txcsum6 -tso4 -lro

Re-enabeling these on will make the behavior return, this can be done live!


### uname -a
FreeBSD nattest 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC amd64

### ifconfig output

vtnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6c05bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:22:06:05:01:0a
        inet 192.168.0.212 netmask 0xffffff00 broadcast 192.168.0.255
        media: Ethernet 10Gbase-T <full-duplex>
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
vtnet1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6c05bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:22:06:0a:01:01
        inet 10.23.10.87 netmask 0xffffff00 broadcast 10.23.10.255
        media: Ethernet 10Gbase-T <full-duplex>
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
groups: pflog


### /etc/pf.conf
if_wan="vtnet0"
if_lan="vtnet1"
net_lan=$if_lan:network

scrub on lo0 all random-id
scrub on vtnet0 all random-id
scrub on vtnet1 all random-id

nat on $if_wan inet from $net_lan to any -> ($if_wan) port 1024:65535

antispoof log for vtnet0
antispoof log for vtnet1

block in all
pass on lo0 all
pass inet proto icmp all icmp-type { echoreq, unreach } keep state
pass out all keep state
pass in on $if_lan proto tcp from $net_lan to any port { 22, 80, 443 }

### /etc/rc.conf

hostname="nattest"
gateway_enable="YES"
ifconfig_vtnet0="DHCP"
ifconfig_vtnet1="10.23.10.87 netmask 255.255.255.0"

zfs_enable="YES"
clear_tmp_enable="YES"
sshd_enable="YES"
dumpdev="AUTO"
pf_enable="YES"
pflog_enable="YES"
Comment 1 Jorge Schrauwen 2019-02-08 17:09:15 UTC
I was also able to reproduce the bug using ipfw with this config

firewall_enable="YES"
firewall_type="OPEN"
firewall_logging="YES"
natd_enable="YES"
natd_interface="vtnet0"
natd_flags="-dynamic -m"
Comment 2 Eugene Grosbein freebsd_committer 2019-02-08 20:50:57 UTC
(In reply to Jorge Schrauwen from comment #1)

For ipfw nat and/or natd both based on libalias, this is known and documented in the ipfw(8) manual page:

     Due to the architecture of libalias(3), ipfw nat is not compatible with
     the TCP segmentation offloading (TSO).  Thus, to reliably nat your
     network traffic, please disable TSO on your NICs using ifconfig(8).
Comment 3 Jorge Schrauwen 2019-02-08 21:58:01 UTC
Good to know about ipfw, I was discussing this with kp and he suggested to try it with a different firewall to confirm or rule out pf nat issues.
Comment 4 Kristof Provost freebsd_committer 2019-02-09 11:04:33 UTC
(In reply to Eugene Grosbein from comment #2)
Right, that makes sense, and I keep forgetting that about ipfw.

Does ipf have the same limitation? I'd quite like to work out if the problem is in vtnet of in pf.

There's another report of issues that look similar and where pf doesn't appear to be a factor:
https://lists.freebsd.org/pipermail/freebsd-questions/2019-February/284348.html
Comment 5 Eugene Grosbein freebsd_committer 2019-02-09 11:20:22 UTC
(In reply to Kristof Provost from comment #4)

I do not know, never used pfnat.
Comment 6 Jorge Schrauwen 2019-04-18 18:49:10 UTC
So for ipf it's https://www.freebsd.org/doc/handbook/firewalls-ipf.html right?
I'm a bit busy but with the long holiday weekend I might have a few hours to try and replicate this with ipf.
Comment 7 Eugene Grosbein freebsd_committer 2019-11-12 06:26:58 UTC
(In reply to Jorge Schrauwen from comment #6)

Had you a chance to do tests you intended?
Comment 8 Jorge Schrauwen 2019-11-12 07:05:04 UTC
Oops, I was pertty sure I did update this with the ipf results. But guess I did not.

I could not get ipf to work either, turns out it was similar to the native firewall on illumos (where I was running the bhyve instance).

Turns out the illumos version of ipf also has the issue: https://smartos.org/bugview/OS-7924.

Joyent who are doing the bhyve fork on illumos and did all the offloading work are going to revert the change where loopback traffic (in the broader sense here that any traffic not hitting the mac of a physical interface, so inter guest traffic too) would not get checksummed soonish. As other software in bhyve guests and native zones is also not dealing properly with this. e.g. vpnservers like wireguard, openvpn,... 
https://smartos.org/bugview/OS-8025

More details on the revert of this can be found here: https://smartos.org/bugview/OS-8027

So while it looks like ipf, ipfw, and pf do indeed not cope well with traffic that has blank checksums when all the offloading is enabled on the vtnet interface... it's certainly not the only code that has issues with it.
Comment 9 Vincenzo Maffione freebsd_committer 2019-11-12 21:32:19 UTC
Not 100% sure, but I think you are hitting the same problem as me.
If I am not mistaken, you can reproduce a similar problem with a simpler (test) configuration in your fbsd_guest_fw: no layer 3 processing, but just a bridge0 to bridge vtnet0 and vtnet1 together, e.g.:

   # ifconfig bridge0 create addm vtnet0 addm vtnet1

In this scenario (with csum enabled) packets coming from vtnet0 with "incorrect checksum" will exit vtnet1 with incorrect checksum, and will be discarded at the next hop.
The root of the problem is that vtnet cannot really implement a csum offload (since there is no hw that does that), but it rather delays the TCP checksum computation hoping that someone will do that later (if necessary). Unfortunately, the FreeBSD mbuf does not have enough metadata to store and forward the metadata contained in the per-packet virtio-net header (as defined in the VirtIO standard). This information is already lost when the packet arrives on vtnet1 on transmission, so that vtnet1 is not asked to do the offloads (CSUM_TCP is not set), and therefore the packet does not survive the next hop.

See also here:
https://reviews.freebsd.org/D21315

About your setup: I see that you have offloads support on vtnet0 and vtnet1 (e.g. TXCSUM, RXCSUM, etc.) --> I guess you are not using bhyve as an hypervisor, because it does not support offloads in 12.0? I guess you are using Linux QEMU/KVM with TAP interfaces backing the vtnet* devices?
Comment 10 mike 2019-11-12 21:46:54 UTC
(In reply to Vincenzo Maffione from comment #9)

wow, this is the problem I got caught up in this morning I think.  A bunch of FreeBSD VMs in KVM (Ubuntu 18.x).  FreeBSD firewall with vtnet0 and vtnet1
VTNET0 is the public IP and VTNET1 connected a bunch of internal VMs.

Going out from the gateway VM was not an issue. However, VMs behind the gateway could not establish TCP connections.  On a remote host, the SYN packet would get to the host, but the host would never respond with a SYN-ACK... It would just ignore the SYN. When I switched all the devices to em NICs instead of VTNET, everything worked as expected. 

A good connection initiated from the gateway across the public IP looked like

 xxx.yyy.135.74.36086 > AAA.BBB.148.55.22: Flags [S], cksum 0xbf90 (correct), seq 489690757, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 3421165670 ecr 0], length 0


and the bad connection with the vtnet trying to nat the internal VM as seen at the external host

16:12:39.071314 IP (tos 0x10, ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    xxx.yyy.135.74.50993 > AAA.BBB.148.55.22: Flags [S], cksum 0x521b (incorrect -> 0x88ab), seq 515367701, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 898080092 ecr 0], length 0


bad checksum....
Comment 11 Jorge Schrauwen 2019-11-12 22:14:58 UTC
(In reply to Vincenzo Maffione from comment #9)
I am using bhyve, but on an illumos host system. On illumos bhyve does support this offloading.

I was talking about this with Tom Jones at FOSDEM and again at EuroBSDcon 2019, the main problem is indeed that FreeBSD's bhyve did not support all the offloading which made producing a test case that worked on FreeBSD+Bhyve rather problematic.

I only run a mix of FreeBSD and illumos (SmartOS) so I am not sure the behavior on Linux/KVM matches that of illumos/bhyve. I do know that freebsd/bhyve did not have the problem due as you said, lack of all the offloading features.
Comment 12 Vincenzo Maffione freebsd_committer 2019-11-13 18:38:57 UTC
(In reply to mike from comment #10)
Yes, that's exactly the problem I'm describing.
We need to ask freebsd-net folks to ask for suggestions, e.g. a way to append more metadata to the mbuf.
Comment 13 Vincenzo Maffione freebsd_committer 2019-11-13 18:40:17 UTC
(In reply to Jorge Schrauwen from comment #11)
I see.
For your information on freebsd-current now bhyve supports the offloads when the netmap(4) VALE(4) backend is used (instead of TAP).
Comment 14 Jorge Schrauwen 2019-11-13 18:59:56 UTC
(In reply to Vincenzo Maffione from comment #13)
That's good to know, which probably means eventually more people will hit the bug.