Bug 235607

Summary: Incorrect checksums with NAT on vtnet with offloading
Product: Base System Reporter: Jorge Schrauwen <sjorge+signup>
Component: kernAssignee: Bryan Venteicher <bryanv>
Status: Open ---    
Severity: Affects Only Me CC: afedorov, contact+freebsd-bug, eugen, freebsd, jeremfg, johan, kp, mike, mikko.tanner, nchevsky, vmaffione
Priority: ---    
Version: 12.0-STABLE   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236309
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=165059

Description Jorge Schrauwen 2019-02-08 17:08:14 UTC
### description

The issue only pops up when there is no valid tcp checksum present on the source traffic, it 'works' when the csum is valid. I did verify with the illumos people about if a bad checksum will be pass along as is or not with offloading enabled:

[17:03:15] <sjorge> rzezeski qq, if a guest with csum enabled sends a checksummed packet does it get discarded and recalculate when it hits the physical nic before it goes on the wire? Or is it kept as is?
[17:03:26] <sjorge> From my captures it seems to be kept as is?
[17:05:26] <rzezeski> sjorge: It depends on the guest OS. In illumos, either the IP stack calcs the checksum on Tx or places a flag on the dblk to have it calculated by hardware. If you do both then the hardware will calculate a second sum over the current one and it will be incorrect.
[17:05:42] <rzezeski> 1) <reply to diff question> 2) I have no idea how FBSD/Linux work in that regard, but they would have the same issue to deal with.

I beleive the later case is what is happening here, pf nat adds checksum based on empty initial checksum and then it gets pass as-is out the nic. That would also explain why a packet with a valid checksum that comes in will pass back out correctly after pf nat.

<-> = phsyical link, aka cat5/cat6 plugged into a NIC or Switch
<~> = loopback link, aka traffic between bhyve guests, between host and bhyve guest, ... stuff that never hit the MAC layer.
<.> = wireless link

This flow that is currently broken (only when pf nat is involved):

fbsd_guest1 <~> fbsd_guest_fw <-> switch(1) <-> modem
win10_guest <~> fbsd_guest_fw <-> switch(1) <-> modem

This flow is OK:

macbook <-> switch <->(2) fbsd_guest_fw <-> switch <-> modem
macbook <.> AP <-> switch <-> fbsd_guest_fw <-> switch <-> modem

1) using port-mirror on the switch I was able to confirm packets with a bad checksum end up on the wire
2) the bhyve guest has a vnic on the physical nic (they are considered bridge for the host OS)

#### workaround but comes at a performance hit
root@nattest:~ # ifconfig vtnet0 -rxcsum -txcsum -rxcsum6 -txcsum6 -tso4 -lro
root@nattest:~ # ifconfig vtnet1 -rxcsum -txcsum -rxcsum6 -txcsum6 -tso4 -lro

Re-enabeling these on will make the behavior return, this can be done live!


### uname -a
FreeBSD nattest 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC amd64

### ifconfig output

vtnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6c05bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:22:06:05:01:0a
        inet 192.168.0.212 netmask 0xffffff00 broadcast 192.168.0.255
        media: Ethernet 10Gbase-T <full-duplex>
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
vtnet1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6c05bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:22:06:0a:01:01
        inet 10.23.10.87 netmask 0xffffff00 broadcast 10.23.10.255
        media: Ethernet 10Gbase-T <full-duplex>
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
groups: pflog


### /etc/pf.conf
if_wan="vtnet0"
if_lan="vtnet1"
net_lan=$if_lan:network

scrub on lo0 all random-id
scrub on vtnet0 all random-id
scrub on vtnet1 all random-id

nat on $if_wan inet from $net_lan to any -> ($if_wan) port 1024:65535

antispoof log for vtnet0
antispoof log for vtnet1

block in all
pass on lo0 all
pass inet proto icmp all icmp-type { echoreq, unreach } keep state
pass out all keep state
pass in on $if_lan proto tcp from $net_lan to any port { 22, 80, 443 }

### /etc/rc.conf

hostname="nattest"
gateway_enable="YES"
ifconfig_vtnet0="DHCP"
ifconfig_vtnet1="10.23.10.87 netmask 255.255.255.0"

zfs_enable="YES"
clear_tmp_enable="YES"
sshd_enable="YES"
dumpdev="AUTO"
pf_enable="YES"
pflog_enable="YES"
Comment 1 Jorge Schrauwen 2019-02-08 17:09:15 UTC
I was also able to reproduce the bug using ipfw with this config

firewall_enable="YES"
firewall_type="OPEN"
firewall_logging="YES"
natd_enable="YES"
natd_interface="vtnet0"
natd_flags="-dynamic -m"
Comment 2 Eugene Grosbein freebsd_committer freebsd_triage 2019-02-08 20:50:57 UTC
(In reply to Jorge Schrauwen from comment #1)

For ipfw nat and/or natd both based on libalias, this is known and documented in the ipfw(8) manual page:

     Due to the architecture of libalias(3), ipfw nat is not compatible with
     the TCP segmentation offloading (TSO).  Thus, to reliably nat your
     network traffic, please disable TSO on your NICs using ifconfig(8).
Comment 3 Jorge Schrauwen 2019-02-08 21:58:01 UTC
Good to know about ipfw, I was discussing this with kp and he suggested to try it with a different firewall to confirm or rule out pf nat issues.
Comment 4 Kristof Provost freebsd_committer freebsd_triage 2019-02-09 11:04:33 UTC
(In reply to Eugene Grosbein from comment #2)
Right, that makes sense, and I keep forgetting that about ipfw.

Does ipf have the same limitation? I'd quite like to work out if the problem is in vtnet of in pf.

There's another report of issues that look similar and where pf doesn't appear to be a factor:
https://lists.freebsd.org/pipermail/freebsd-questions/2019-February/284348.html
Comment 5 Eugene Grosbein freebsd_committer freebsd_triage 2019-02-09 11:20:22 UTC
(In reply to Kristof Provost from comment #4)

I do not know, never used pfnat.
Comment 6 Jorge Schrauwen 2019-04-18 18:49:10 UTC
So for ipf it's https://www.freebsd.org/doc/handbook/firewalls-ipf.html right?
I'm a bit busy but with the long holiday weekend I might have a few hours to try and replicate this with ipf.
Comment 7 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-12 06:26:58 UTC
(In reply to Jorge Schrauwen from comment #6)

Had you a chance to do tests you intended?
Comment 8 Jorge Schrauwen 2019-11-12 07:05:04 UTC
Oops, I was pertty sure I did update this with the ipf results. But guess I did not.

I could not get ipf to work either, turns out it was similar to the native firewall on illumos (where I was running the bhyve instance).

Turns out the illumos version of ipf also has the issue: https://smartos.org/bugview/OS-7924.

Joyent who are doing the bhyve fork on illumos and did all the offloading work are going to revert the change where loopback traffic (in the broader sense here that any traffic not hitting the mac of a physical interface, so inter guest traffic too) would not get checksummed soonish. As other software in bhyve guests and native zones is also not dealing properly with this. e.g. vpnservers like wireguard, openvpn,... 
https://smartos.org/bugview/OS-8025

More details on the revert of this can be found here: https://smartos.org/bugview/OS-8027

So while it looks like ipf, ipfw, and pf do indeed not cope well with traffic that has blank checksums when all the offloading is enabled on the vtnet interface... it's certainly not the only code that has issues with it.
Comment 9 Vincenzo Maffione freebsd_committer freebsd_triage 2019-11-12 21:32:19 UTC
Not 100% sure, but I think you are hitting the same problem as me.
If I am not mistaken, you can reproduce a similar problem with a simpler (test) configuration in your fbsd_guest_fw: no layer 3 processing, but just a bridge0 to bridge vtnet0 and vtnet1 together, e.g.:

   # ifconfig bridge0 create addm vtnet0 addm vtnet1

In this scenario (with csum enabled) packets coming from vtnet0 with "incorrect checksum" will exit vtnet1 with incorrect checksum, and will be discarded at the next hop.
The root of the problem is that vtnet cannot really implement a csum offload (since there is no hw that does that), but it rather delays the TCP checksum computation hoping that someone will do that later (if necessary). Unfortunately, the FreeBSD mbuf does not have enough metadata to store and forward the metadata contained in the per-packet virtio-net header (as defined in the VirtIO standard). This information is already lost when the packet arrives on vtnet1 on transmission, so that vtnet1 is not asked to do the offloads (CSUM_TCP is not set), and therefore the packet does not survive the next hop.

See also here:
https://reviews.freebsd.org/D21315

About your setup: I see that you have offloads support on vtnet0 and vtnet1 (e.g. TXCSUM, RXCSUM, etc.) --> I guess you are not using bhyve as an hypervisor, because it does not support offloads in 12.0? I guess you are using Linux QEMU/KVM with TAP interfaces backing the vtnet* devices?
Comment 10 mike 2019-11-12 21:46:54 UTC
(In reply to Vincenzo Maffione from comment #9)

wow, this is the problem I got caught up in this morning I think.  A bunch of FreeBSD VMs in KVM (Ubuntu 18.x).  FreeBSD firewall with vtnet0 and vtnet1
VTNET0 is the public IP and VTNET1 connected a bunch of internal VMs.

Going out from the gateway VM was not an issue. However, VMs behind the gateway could not establish TCP connections.  On a remote host, the SYN packet would get to the host, but the host would never respond with a SYN-ACK... It would just ignore the SYN. When I switched all the devices to em NICs instead of VTNET, everything worked as expected. 

A good connection initiated from the gateway across the public IP looked like

 xxx.yyy.135.74.36086 > AAA.BBB.148.55.22: Flags [S], cksum 0xbf90 (correct), seq 489690757, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 3421165670 ecr 0], length 0


and the bad connection with the vtnet trying to nat the internal VM as seen at the external host

16:12:39.071314 IP (tos 0x10, ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    xxx.yyy.135.74.50993 > AAA.BBB.148.55.22: Flags [S], cksum 0x521b (incorrect -> 0x88ab), seq 515367701, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 898080092 ecr 0], length 0


bad checksum....
Comment 11 Jorge Schrauwen 2019-11-12 22:14:58 UTC
(In reply to Vincenzo Maffione from comment #9)
I am using bhyve, but on an illumos host system. On illumos bhyve does support this offloading.

I was talking about this with Tom Jones at FOSDEM and again at EuroBSDcon 2019, the main problem is indeed that FreeBSD's bhyve did not support all the offloading which made producing a test case that worked on FreeBSD+Bhyve rather problematic.

I only run a mix of FreeBSD and illumos (SmartOS) so I am not sure the behavior on Linux/KVM matches that of illumos/bhyve. I do know that freebsd/bhyve did not have the problem due as you said, lack of all the offloading features.
Comment 12 Vincenzo Maffione freebsd_committer freebsd_triage 2019-11-13 18:38:57 UTC
(In reply to mike from comment #10)
Yes, that's exactly the problem I'm describing.
We need to ask freebsd-net folks to ask for suggestions, e.g. a way to append more metadata to the mbuf.
Comment 13 Vincenzo Maffione freebsd_committer freebsd_triage 2019-11-13 18:40:17 UTC
(In reply to Jorge Schrauwen from comment #11)
I see.
For your information on freebsd-current now bhyve supports the offloads when the netmap(4) VALE(4) backend is used (instead of TAP).
Comment 14 Jorge Schrauwen 2019-11-13 18:59:56 UTC
(In reply to Vincenzo Maffione from comment #13)
That's good to know, which probably means eventually more people will hit the bug.
Comment 15 Eugene Grosbein freebsd_committer freebsd_triage 2022-09-10 20:04:28 UTC
FreeBSD's vtnet(4) driver upto 12.3 version implements checksum offload for transmit path ONLY.

For receive path, it blindly assumes that ALL traffic comes from another virtual machine running on same hypervisor that just did checksumming already, so it just skips its own checksumming. The solution is to disable non-working rxcsum "offload" with ifconfig(8):

ifconfig vtnet0 -rxcsum

In case of NAT, this solves the problem.
Comment 16 Eugene Grosbein freebsd_committer freebsd_triage 2022-09-10 20:09:44 UTC
Bryan, please take a look at this PR. Will you merge your improvements to vtnet(4) to stable/12? At lease we should have some warning in the manual page vtnet.4 for users that do not use vtnet(4) for inter-VM communication but for NAT to the public network and translate and route traffic received from vtnet0.
Comment 17 Johan Ström 2023-09-26 07:56:16 UTC
FYI, still an issue on 13.2-RELEASE-p3.
Comment 18 Johan Ström 2023-10-10 20:28:35 UTC
So, rather than just my "+1" post I thought I'd write down my experiments.

While in essence nothing new, it shows that the problem occurs without NAT involved too, and if nothing else it perhaps gives some hints to others with similar issues.

TL;DR:
whenever a packet received on an vtnet interface with rxcsum enabled, if forwarded with or without NAT, it will be emitted with an invalid checksum and any receiver, VM  or external, will ignore the packets.
Disabling rxcsum on the gw's receiving interface solves the issue but with big performance impact (~25Gbps -> 5Gpbs).
In addition, re-enabling rxcsum does NOT restore performance (but
forwarding stops working again), reboot is required.
An identical Linux VM is able to route+NAT traffic at ~20Gbps on the same host.

Setup:
VM Host: single machine, proxmox 8.0.4 with kernel 6.2.16-12-pve, epyc with 16 cores, 128GB, no other load
VM guests: FreeBSD 13.2-RELEASE-p3, 4 cores.

Host Network:
   vlan-aware bridge vmbr0 with tap interfaces for each VM
   no proxmox firewall
   mac filtering turned off on guests

Guests (all 4 cores):
   gw:
      vtnet0: (mgmt)
      vtnet0/vlan10 (172.28.1.213/24)
      vtnet0/vlan30 (172.28.30.1/24)
      default route: 172.28.1.1
   gw-test:
      vtnet0 (untagged vlan 30) 172.28.30.10/24
      default route: 172.28.30.1
   tgt-test:
      vtnet0 (untagged vlan 10) 172.28.1.53/24

External hosts:
   172.28.1.2 on vlan10

iperf3 between gw and gw-test: ~20-25Gbps in both directions

If gw itself is sending UDP/TCP packets out on vlan10 to any other host in vlan10, virtual or physical remote, it works fine.
if gw-test is sending UDP/TCP packets towards something on 172.28.1.0/24, it goes out on gw-test/vtnet0, arrives on gw/vlan30, goes out on gw/vlan10.
When received on the remote end, regardless if it's virtual on same host or
other physical machine, it is received with an invalid checksum and is
ignored.
This happens regardless of any PF NAT rules, even if just routed the egress checksums are broken.

For example, tcpdump on receiving end (172.28.1.2) shows tcp/udp checksums are invalid:
21:56:37.426824 02:0a:93:44:78:1b > 00:25:90:f4:df:5c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.28.30.10.46785 > 172.28.1.2.80: Flags [S], cksum 0x7773 (incorrect -> 0x07dd), seq 171480550, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 372923253 ecr 0], length 0

Setting -txcsum on any interface on any machine does not have any effect.
I perhaps hoped that just disabling txcsum offloading on either machine would perhaps make it produce proper packets.

Setting -rxcsum on vtnet0 interface on gw-test does not fix the problem, but performance with gw sending to gw-test is dropped to ~5Gbps. Other direction unaffected.
Re-enabling rxcsum on gw-test does NOT restore performance. Down/up'ing interface does not help. Reboot does help.

Setting -rxcsum on vtnet0 interface on gw, and traffic starts passing fine (with and without pf NAT).
Performance when gw sending to gw-test drops to ~5Gbps. Other direction unaffected.
Re-enabling rxcsum on gw-test will break routed traffic again, but performance
for direct tests is NOT restored.

Have also tried with two vtnet interfaces on gw1, each with a vlan sub-interface (10 on one interfce, 30 on the other).
Any combo of rx/tx checksums on interface with vlan10 does not make a difference.
Disabling rxcsum on vtnet with vlan 30 will make it work, just as above.

Also tried with vlan 10 untagged on vtnet1 (no vlan interface), no difference.

Also tried with e1000 interfaces, now everything works but performance is even worse ( <1Gbps iirc) 


For comparison, when iperf'ing from gw-test to a VM with linux (6.4.9, ovirt interface in vlan 30):
I can send to it with 25Gbps and but for some reason only receive from it at
14.8Gbps (vs ~25 when receiving from FreeBSD VM).
Disabling rxcsum on gw-test, I'm down to 5Gbps receive again.
However, when routing & doing iptables masquerading through this linux VM, I can do ~20Gbps in both directions towards tgt-test.
So actual routing is quick, but when iperf3 is running on that machine it is slower.


In reality I don't think I'll have any issues with "just" 5Gbps routing capacity, but it would be nice to not have to waste cpu cycles on rx checksumming when it "should work". 

Hopefully this rambling can be useful for someone :)
Comment 19 contact+freebsd-bug 2023-12-01 02:48:48 UTC
This is it. This is the bug I have been hitting for two days and I thought I was going insane.

My setup was fairly simple: One, virtualized FreeBSD(14-RELEASE) router, and a FreeBSD host VMs. A virtual network connected all the hosts to the router. The router had two NICs, one in the LAN one for WAN. Easy. Router was using IPFW and in kernel NAT. (TSO disabled on the virtualized router, as per documentation).

I re-wrote my ipfw configuration about 10 times thinking I HAD to be doing something wrong. 

The craziness:

All traffic from the router went where it was supposed to.

NAT worked for UDP and ICMP traffic from the LAN. 
NAT 'looked' like it worked for TCP from the LAN.

What do I mean by 'looked' like it worked?

Running a packet capture on the router on the WAN interface and the ipfw0 interface showed LAN packets coming in and getting NAT'd to the WAN correctly.

Running a packet capture on an upstream host showed that the packets were coming in from the correct source, however, the hosts REFUSED to respond to them so a SYN/ACK was never sent back.

Finally, looking at a packet capture from some LAN hosts I saw that the checksum was invalid when trying to make a HTTP(S) connection. Odd. Using traceroute with  tcp/udp/icmp protos did not produce a checksum error. I ignored this for a while (I shouldn't have done that in hindsight) and continue to debug the problem.

After hitting nothing but walls and reading forum post after post of people simply screwing up their ipfw config, I searched for 'freebsd nat tcp checksum invalid' I found this bug. 

This was it. The entire problem. Base on the notes about TSO in the manual for in-kernel nat, I would have never guessed that TSO should be disabled on the VM host interfaces and not just the virtual nics in the router.