| Summary: | Incorrect checksums with NAT on vtnet with offloading | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Jorge Schrauwen <sjorge+signup> |
| Component: | kern | Assignee: | Bryan Venteicher <bryanv> |
| Status: | Open --- | ||
| Severity: | Affects Only Me | CC: | afedorov, contact+freebsd-bug, eugen, freebsd, jeremfg, johan, kp, mike, mikko.tanner, nchevsky, vmaffione |
| Priority: | --- | ||
| Version: | 12.0-STABLE | ||
| Hardware: | amd64 | ||
| OS: | Any | ||
| See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236309 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=165059 |
||
I was also able to reproduce the bug using ipfw with this config firewall_enable="YES" firewall_type="OPEN" firewall_logging="YES" natd_enable="YES" natd_interface="vtnet0" natd_flags="-dynamic -m" (In reply to Jorge Schrauwen from comment #1) For ipfw nat and/or natd both based on libalias, this is known and documented in the ipfw(8) manual page: Due to the architecture of libalias(3), ipfw nat is not compatible with the TCP segmentation offloading (TSO). Thus, to reliably nat your network traffic, please disable TSO on your NICs using ifconfig(8). Good to know about ipfw, I was discussing this with kp and he suggested to try it with a different firewall to confirm or rule out pf nat issues. (In reply to Eugene Grosbein from comment #2) Right, that makes sense, and I keep forgetting that about ipfw. Does ipf have the same limitation? I'd quite like to work out if the problem is in vtnet of in pf. There's another report of issues that look similar and where pf doesn't appear to be a factor: https://lists.freebsd.org/pipermail/freebsd-questions/2019-February/284348.html (In reply to Kristof Provost from comment #4) I do not know, never used pfnat. So for ipf it's https://www.freebsd.org/doc/handbook/firewalls-ipf.html right? I'm a bit busy but with the long holiday weekend I might have a few hours to try and replicate this with ipf. (In reply to Jorge Schrauwen from comment #6) Had you a chance to do tests you intended? Oops, I was pertty sure I did update this with the ipf results. But guess I did not. I could not get ipf to work either, turns out it was similar to the native firewall on illumos (where I was running the bhyve instance). Turns out the illumos version of ipf also has the issue: https://smartos.org/bugview/OS-7924. Joyent who are doing the bhyve fork on illumos and did all the offloading work are going to revert the change where loopback traffic (in the broader sense here that any traffic not hitting the mac of a physical interface, so inter guest traffic too) would not get checksummed soonish. As other software in bhyve guests and native zones is also not dealing properly with this. e.g. vpnservers like wireguard, openvpn,... https://smartos.org/bugview/OS-8025 More details on the revert of this can be found here: https://smartos.org/bugview/OS-8027 So while it looks like ipf, ipfw, and pf do indeed not cope well with traffic that has blank checksums when all the offloading is enabled on the vtnet interface... it's certainly not the only code that has issues with it. Not 100% sure, but I think you are hitting the same problem as me. If I am not mistaken, you can reproduce a similar problem with a simpler (test) configuration in your fbsd_guest_fw: no layer 3 processing, but just a bridge0 to bridge vtnet0 and vtnet1 together, e.g.: # ifconfig bridge0 create addm vtnet0 addm vtnet1 In this scenario (with csum enabled) packets coming from vtnet0 with "incorrect checksum" will exit vtnet1 with incorrect checksum, and will be discarded at the next hop. The root of the problem is that vtnet cannot really implement a csum offload (since there is no hw that does that), but it rather delays the TCP checksum computation hoping that someone will do that later (if necessary). Unfortunately, the FreeBSD mbuf does not have enough metadata to store and forward the metadata contained in the per-packet virtio-net header (as defined in the VirtIO standard). This information is already lost when the packet arrives on vtnet1 on transmission, so that vtnet1 is not asked to do the offloads (CSUM_TCP is not set), and therefore the packet does not survive the next hop. See also here: https://reviews.freebsd.org/D21315 About your setup: I see that you have offloads support on vtnet0 and vtnet1 (e.g. TXCSUM, RXCSUM, etc.) --> I guess you are not using bhyve as an hypervisor, because it does not support offloads in 12.0? I guess you are using Linux QEMU/KVM with TAP interfaces backing the vtnet* devices? (In reply to Vincenzo Maffione from comment #9) wow, this is the problem I got caught up in this morning I think. A bunch of FreeBSD VMs in KVM (Ubuntu 18.x). FreeBSD firewall with vtnet0 and vtnet1 VTNET0 is the public IP and VTNET1 connected a bunch of internal VMs. Going out from the gateway VM was not an issue. However, VMs behind the gateway could not establish TCP connections. On a remote host, the SYN packet would get to the host, but the host would never respond with a SYN-ACK... It would just ignore the SYN. When I switched all the devices to em NICs instead of VTNET, everything worked as expected. A good connection initiated from the gateway across the public IP looked like xxx.yyy.135.74.36086 > AAA.BBB.148.55.22: Flags [S], cksum 0xbf90 (correct), seq 489690757, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 3421165670 ecr 0], length 0 and the bad connection with the vtnet trying to nat the internal VM as seen at the external host 16:12:39.071314 IP (tos 0x10, ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 60) xxx.yyy.135.74.50993 > AAA.BBB.148.55.22: Flags [S], cksum 0x521b (incorrect -> 0x88ab), seq 515367701, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 898080092 ecr 0], length 0 bad checksum.... (In reply to Vincenzo Maffione from comment #9) I am using bhyve, but on an illumos host system. On illumos bhyve does support this offloading. I was talking about this with Tom Jones at FOSDEM and again at EuroBSDcon 2019, the main problem is indeed that FreeBSD's bhyve did not support all the offloading which made producing a test case that worked on FreeBSD+Bhyve rather problematic. I only run a mix of FreeBSD and illumos (SmartOS) so I am not sure the behavior on Linux/KVM matches that of illumos/bhyve. I do know that freebsd/bhyve did not have the problem due as you said, lack of all the offloading features. (In reply to mike from comment #10) Yes, that's exactly the problem I'm describing. We need to ask freebsd-net folks to ask for suggestions, e.g. a way to append more metadata to the mbuf. (In reply to Jorge Schrauwen from comment #11) I see. For your information on freebsd-current now bhyve supports the offloads when the netmap(4) VALE(4) backend is used (instead of TAP). (In reply to Vincenzo Maffione from comment #13) That's good to know, which probably means eventually more people will hit the bug. FreeBSD's vtnet(4) driver upto 12.3 version implements checksum offload for transmit path ONLY. For receive path, it blindly assumes that ALL traffic comes from another virtual machine running on same hypervisor that just did checksumming already, so it just skips its own checksumming. The solution is to disable non-working rxcsum "offload" with ifconfig(8): ifconfig vtnet0 -rxcsum In case of NAT, this solves the problem. Bryan, please take a look at this PR. Will you merge your improvements to vtnet(4) to stable/12? At lease we should have some warning in the manual page vtnet.4 for users that do not use vtnet(4) for inter-VM communication but for NAT to the public network and translate and route traffic received from vtnet0. FYI, still an issue on 13.2-RELEASE-p3. So, rather than just my "+1" post I thought I'd write down my experiments.
While in essence nothing new, it shows that the problem occurs without NAT involved too, and if nothing else it perhaps gives some hints to others with similar issues.
TL;DR:
whenever a packet received on an vtnet interface with rxcsum enabled, if forwarded with or without NAT, it will be emitted with an invalid checksum and any receiver, VM or external, will ignore the packets.
Disabling rxcsum on the gw's receiving interface solves the issue but with big performance impact (~25Gbps -> 5Gpbs).
In addition, re-enabling rxcsum does NOT restore performance (but
forwarding stops working again), reboot is required.
An identical Linux VM is able to route+NAT traffic at ~20Gbps on the same host.
Setup:
VM Host: single machine, proxmox 8.0.4 with kernel 6.2.16-12-pve, epyc with 16 cores, 128GB, no other load
VM guests: FreeBSD 13.2-RELEASE-p3, 4 cores.
Host Network:
vlan-aware bridge vmbr0 with tap interfaces for each VM
no proxmox firewall
mac filtering turned off on guests
Guests (all 4 cores):
gw:
vtnet0: (mgmt)
vtnet0/vlan10 (172.28.1.213/24)
vtnet0/vlan30 (172.28.30.1/24)
default route: 172.28.1.1
gw-test:
vtnet0 (untagged vlan 30) 172.28.30.10/24
default route: 172.28.30.1
tgt-test:
vtnet0 (untagged vlan 10) 172.28.1.53/24
External hosts:
172.28.1.2 on vlan10
iperf3 between gw and gw-test: ~20-25Gbps in both directions
If gw itself is sending UDP/TCP packets out on vlan10 to any other host in vlan10, virtual or physical remote, it works fine.
if gw-test is sending UDP/TCP packets towards something on 172.28.1.0/24, it goes out on gw-test/vtnet0, arrives on gw/vlan30, goes out on gw/vlan10.
When received on the remote end, regardless if it's virtual on same host or
other physical machine, it is received with an invalid checksum and is
ignored.
This happens regardless of any PF NAT rules, even if just routed the egress checksums are broken.
For example, tcpdump on receiving end (172.28.1.2) shows tcp/udp checksums are invalid:
21:56:37.426824 02:0a:93:44:78:1b > 00:25:90:f4:df:5c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.28.30.10.46785 > 172.28.1.2.80: Flags [S], cksum 0x7773 (incorrect -> 0x07dd), seq 171480550, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 372923253 ecr 0], length 0
Setting -txcsum on any interface on any machine does not have any effect.
I perhaps hoped that just disabling txcsum offloading on either machine would perhaps make it produce proper packets.
Setting -rxcsum on vtnet0 interface on gw-test does not fix the problem, but performance with gw sending to gw-test is dropped to ~5Gbps. Other direction unaffected.
Re-enabling rxcsum on gw-test does NOT restore performance. Down/up'ing interface does not help. Reboot does help.
Setting -rxcsum on vtnet0 interface on gw, and traffic starts passing fine (with and without pf NAT).
Performance when gw sending to gw-test drops to ~5Gbps. Other direction unaffected.
Re-enabling rxcsum on gw-test will break routed traffic again, but performance
for direct tests is NOT restored.
Have also tried with two vtnet interfaces on gw1, each with a vlan sub-interface (10 on one interfce, 30 on the other).
Any combo of rx/tx checksums on interface with vlan10 does not make a difference.
Disabling rxcsum on vtnet with vlan 30 will make it work, just as above.
Also tried with vlan 10 untagged on vtnet1 (no vlan interface), no difference.
Also tried with e1000 interfaces, now everything works but performance is even worse ( <1Gbps iirc)
For comparison, when iperf'ing from gw-test to a VM with linux (6.4.9, ovirt interface in vlan 30):
I can send to it with 25Gbps and but for some reason only receive from it at
14.8Gbps (vs ~25 when receiving from FreeBSD VM).
Disabling rxcsum on gw-test, I'm down to 5Gbps receive again.
However, when routing & doing iptables masquerading through this linux VM, I can do ~20Gbps in both directions towards tgt-test.
So actual routing is quick, but when iperf3 is running on that machine it is slower.
In reality I don't think I'll have any issues with "just" 5Gbps routing capacity, but it would be nice to not have to waste cpu cycles on rx checksumming when it "should work".
Hopefully this rambling can be useful for someone :)
This is it. This is the bug I have been hitting for two days and I thought I was going insane. My setup was fairly simple: One, virtualized FreeBSD(14-RELEASE) router, and a FreeBSD host VMs. A virtual network connected all the hosts to the router. The router had two NICs, one in the LAN one for WAN. Easy. Router was using IPFW and in kernel NAT. (TSO disabled on the virtualized router, as per documentation). I re-wrote my ipfw configuration about 10 times thinking I HAD to be doing something wrong. The craziness: All traffic from the router went where it was supposed to. NAT worked for UDP and ICMP traffic from the LAN. NAT 'looked' like it worked for TCP from the LAN. What do I mean by 'looked' like it worked? Running a packet capture on the router on the WAN interface and the ipfw0 interface showed LAN packets coming in and getting NAT'd to the WAN correctly. Running a packet capture on an upstream host showed that the packets were coming in from the correct source, however, the hosts REFUSED to respond to them so a SYN/ACK was never sent back. Finally, looking at a packet capture from some LAN hosts I saw that the checksum was invalid when trying to make a HTTP(S) connection. Odd. Using traceroute with tcp/udp/icmp protos did not produce a checksum error. I ignored this for a while (I shouldn't have done that in hindsight) and continue to debug the problem. After hitting nothing but walls and reading forum post after post of people simply screwing up their ipfw config, I searched for 'freebsd nat tcp checksum invalid' I found this bug. This was it. The entire problem. Base on the notes about TSO in the manual for in-kernel nat, I would have never guessed that TSO should be disabled on the VM host interfaces and not just the virtual nics in the router. |
### description The issue only pops up when there is no valid tcp checksum present on the source traffic, it 'works' when the csum is valid. I did verify with the illumos people about if a bad checksum will be pass along as is or not with offloading enabled: [17:03:15] <sjorge> rzezeski qq, if a guest with csum enabled sends a checksummed packet does it get discarded and recalculate when it hits the physical nic before it goes on the wire? Or is it kept as is? [17:03:26] <sjorge> From my captures it seems to be kept as is? [17:05:26] <rzezeski> sjorge: It depends on the guest OS. In illumos, either the IP stack calcs the checksum on Tx or places a flag on the dblk to have it calculated by hardware. If you do both then the hardware will calculate a second sum over the current one and it will be incorrect. [17:05:42] <rzezeski> 1) <reply to diff question> 2) I have no idea how FBSD/Linux work in that regard, but they would have the same issue to deal with. I beleive the later case is what is happening here, pf nat adds checksum based on empty initial checksum and then it gets pass as-is out the nic. That would also explain why a packet with a valid checksum that comes in will pass back out correctly after pf nat. <-> = phsyical link, aka cat5/cat6 plugged into a NIC or Switch <~> = loopback link, aka traffic between bhyve guests, between host and bhyve guest, ... stuff that never hit the MAC layer. <.> = wireless link This flow that is currently broken (only when pf nat is involved): fbsd_guest1 <~> fbsd_guest_fw <-> switch(1) <-> modem win10_guest <~> fbsd_guest_fw <-> switch(1) <-> modem This flow is OK: macbook <-> switch <->(2) fbsd_guest_fw <-> switch <-> modem macbook <.> AP <-> switch <-> fbsd_guest_fw <-> switch <-> modem 1) using port-mirror on the switch I was able to confirm packets with a bad checksum end up on the wire 2) the bhyve guest has a vnic on the physical nic (they are considered bridge for the host OS) #### workaround but comes at a performance hit root@nattest:~ # ifconfig vtnet0 -rxcsum -txcsum -rxcsum6 -txcsum6 -tso4 -lro root@nattest:~ # ifconfig vtnet1 -rxcsum -txcsum -rxcsum6 -txcsum6 -tso4 -lro Re-enabeling these on will make the behavior return, this can be done live! ### uname -a FreeBSD nattest 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC amd64 ### ifconfig output vtnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6c05bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:22:06:05:01:0a inet 192.168.0.212 netmask 0xffffff00 broadcast 192.168.0.255 media: Ethernet 10Gbase-T <full-duplex> status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> vtnet1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6c05bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:22:06:0a:01:01 inet 10.23.10.87 netmask 0xffffff00 broadcast 10.23.10.255 media: Ethernet 10Gbase-T <full-duplex> status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3 inet 127.0.0.1 netmask 0xff000000 groups: lo nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160 groups: pflog ### /etc/pf.conf if_wan="vtnet0" if_lan="vtnet1" net_lan=$if_lan:network scrub on lo0 all random-id scrub on vtnet0 all random-id scrub on vtnet1 all random-id nat on $if_wan inet from $net_lan to any -> ($if_wan) port 1024:65535 antispoof log for vtnet0 antispoof log for vtnet1 block in all pass on lo0 all pass inet proto icmp all icmp-type { echoreq, unreach } keep state pass out all keep state pass in on $if_lan proto tcp from $net_lan to any port { 22, 80, 443 } ### /etc/rc.conf hostname="nattest" gateway_enable="YES" ifconfig_vtnet0="DHCP" ifconfig_vtnet1="10.23.10.87 netmask 255.255.255.0" zfs_enable="YES" clear_tmp_enable="YES" sshd_enable="YES" dumpdev="AUTO" pf_enable="YES" pflog_enable="YES"