Bug 240416

Summary: [pf] Using TCP timestamps results in incorrect checksums for incoming packets from some peers
Product: Base System Reporter: IPTRACE <arkadiusz.majewski>
Component: kernAssignee: Kristof Provost <kp>
Status: Closed FIXED    
Severity: Affects Only Me CC: kp, tuexen
Priority: ---    
Version: 12.0-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
tcpdump - connection from virtual machine to OVH server
none
Established connection from hypervisor to OVH server
none
tcpdump - connection from virtual machine to OVH server without scaling
none
OS (on VM)to OVH smtp
none
Hypervisor and VM captured packets
none
Hypervisor and VM captured packets (not to OVH none

Description IPTRACE 2019-09-08 12:32:43 UTC
Hello, I have an issue with estblishing TCP connection to OVH systems due to enabled windows scaling in FreeBSD on bhyve VMs packet flow.
Please read the thread about my problem.

https://forums.freebsd.org/threads/no-network-connection-between-specific-mail-servers.70271/
Comment 1 IPTRACE 2019-09-08 13:07:32 UTC
Created attachment 207288 [details]
tcpdump - connection from virtual machine to OVH server
Comment 2 IPTRACE 2019-09-08 13:12:26 UTC
Created attachment 207289 [details]
Established connection from hypervisor to OVH server
Comment 3 IPTRACE 2019-09-08 13:20:36 UTC
Created attachment 207290 [details]
tcpdump - connection from virtual machine to OVH server without scaling
Comment 4 Michael Tuexen freebsd_committer freebsd_triage 2020-11-08 22:57:21 UTC
It looks like the incoming packets are dropped due to a checksum error. Is the problem still happening on FreeBSD 12.2?
Comment 5 IPTRACE 2020-12-16 10:23:25 UTC
(In reply to Michael Tuexen from comment #4)

The problem still exists.

----------------------------------------

World (installed) : 12.2-RELEASE-p2
Kernel (installed): 12.2-RELEASE-p1
Kernel (running)  : 12.2-RELEASE-p1

What is the newest version of FreeBSD available?
--------------------
TYPE="FreeBSD"
REVISION="12.2"
BRANCH="RELEASE-p2"
--------------------

user@mx11:~ % telnet mx3.mail.ovh.net 25
Trying 91.121.53.175...
telnet: connect to address 91.121.53.175: Operation timed out
telnet: Unable to connect to remote host

user@mx11:~ % telnet smtp.gmail.com 25
Trying 173.194.221.109...
Connected to smtp.gmail.com.
Escape character is '^]'.
220 smtp.gmail.com ESMTP 26sm176283lfr.60 - gsmtp
Comment 6 Michael Tuexen freebsd_committer freebsd_triage 2020-12-16 11:24:36 UTC
OK. Let's stick to 12.2. Do you experience the problem in a VM or on a physical host? cCan you provide a .pcap file?
Comment 7 IPTRACE 2020-12-16 11:52:53 UTC
(In reply to Michael Tuexen from comment #6)

On VM only. Packets attached.
Comment 8 IPTRACE 2020-12-16 11:54:18 UTC
Created attachment 220617 [details]
OS (on VM)to OVH smtp
Comment 9 Michael Tuexen freebsd_committer freebsd_triage 2020-12-16 12:37:25 UTC
(In reply to IPTRACE from comment #7)
What kind of VM? What is the host OS? Can you also provide a trace on the host OS? If yes, could uou provide traces from the guest and host OS of the same TCP connection?
I'm trying to figure out which component is responsible for the bug...
Comment 10 IPTRACE 2020-12-16 12:45:49 UTC
(In reply to Michael Tuexen from comment #9)

Hypervisor (bhyve) and VM are on FreeBSD 12.2-RELEASE-p2(world)-p1(kernel). 
Please look at the previous captured packets. It should be enough if not, please let me know.

To be clear. I use bridge and tap on hypervisor for guestes and guests use vtnet0 as network interfaces.

Network flow:

VM (vtnet0) -> tap -> bridge -> hypervisor (ext_int) -> OVH
OVH -> hypervisor (ext_int) -> bridge -> tap -> VM (vtnet0)
Comment 11 Michael Tuexen freebsd_committer freebsd_triage 2020-12-16 12:51:59 UTC
(In reply to IPTRACE from comment #8)
Thanks. The problem is that the received TCP packets have a incorrect checksum. This means that the outgoing TCP segments reach the server and the server accepts them and responds.

However, your VM is using a private address. So I guess a NAT is involved. Do you know where the NAT function is running?

BTW: Your VM is using an interface MTU of 9000 bytes. Is that intentional?
Comment 12 Michael Tuexen freebsd_committer freebsd_triage 2020-12-16 12:55:59 UTC
(In reply to IPTRACE from comment #10)
Can you capture files from the vtnet0 interface from within the VM and the ext_int on the host OS of the same failing connection? That would allow to see if the packets received from the OVH server by the host OS have the correct checksum. Which node is performing NAT?
Comment 13 IPTRACE 2020-12-16 13:19:50 UTC
(In reply to Michael Tuexen from comment #11)

NAT is on hypervisor (I use Packet Filter).
# grep binat /etc/pf.conf

binat on $ext_int from $mx11 to any -> $ext_ip

You have to know the issue is only with OVH servers while window scaling is enabled. I do not have such problem with other mail servers.

Yes, I've set MTU to 9000. During investigation, I was changing to 1500 and the problem still occured.
Comment 14 IPTRACE 2020-12-16 13:35:51 UTC
Created attachment 220618 [details]
Hypervisor and VM captured packets
Comment 15 Michael Tuexen freebsd_committer freebsd_triage 2020-12-17 13:29:07 UTC
(In reply to IPTRACE from comment #14)
Thank you very much for the tracefiles.

The packet (SYN-ACK) received from the OVH server by the hypervisor looks OK, it has a correct checksum. This packet is modified (most likely by pf) as follows:
* the destination address is changed (expected and necessary)
* the TCP timestamps are changed (not sure why)
* the checksum is changed, however the correct value is not chosen.

Since disabling window scaling also disabled the use the time stamps, I would guess that the checksum computation fails if TCP timestamps are used.

Could you provide a tracefile from the guest OS and the host OS of a TCP connection setup to a different server, which works but you still have window scaling and time stamps enabled? I would like to understand the difference.

I have no experience with pf, but are you able to disable the "reassemble tcp" feature and retest?
Comment 16 Michael Tuexen freebsd_committer freebsd_triage 2020-12-17 14:33:43 UTC
This bug could be a duplicate of bug #172648.
Comment 17 IPTRACE 2020-12-17 15:00:26 UTC
(In reply to Michael Tuexen from comment #15)

I commented the line in PF and it started to work.

scrub on $ext_if reassemble tcp

So, the workaround as I disable rfc1323.

user# sudo sysctl net.inet.tcp.rfc1323=0
net.inet.tcp.rfc1323: 1 -> 0


1. What is better to disable PF line or RFC1323 ?
2. Why does it happen only for OVH mail servers ?
Comment 18 IPTRACE 2020-12-17 15:17:15 UTC
Created attachment 220665 [details]
Hypervisor and VM captured packets (not to OVH
Comment 19 Michael Tuexen freebsd_committer freebsd_triage 2020-12-18 12:28:07 UTC
(In reply to IPTRACE from comment #17)

Regarding your first question:
Based on the documentation of pf, reassemble tcp does:
(a) ensure TTL doesn't shrink
(b) randomize TCP timestamps
(c) perform PAWS checks

The FreeBSD TCP does (c) for a long time. I added support for (b) recently, so it is in FreeBSD 12.2. So by disabling the feature in pf, you loose (a).
Disabling window scaling and timestamps can impact the performance of the TCP connections.
Personally, I would disable the line in pf, but you know the tradeoff.

Regarding your second question:
Thanks a lot for the tracefiles, I think I figured out what the problem is.

When pf performs the packet modifications, it has also to modify the TCP checksum accordingly.

The value which is changed in the SYN segment is the TS.val. FreeBSD sends the following TCP options: MSS (4 bytes), NOP (1 byte), WS (3 bytes), SACKOK (2 bytes), and TS (10 bytes). This results in an offset for TS.val of 12. This works.

The values which are changed in the SYN-ACK segment are the TS.val and TS.ecr values.
In the good case the peer responds with the following TCP options: MSS (4 bytes), SACKOK (2 bytes), TS (10 bytes), NOP (1 byte), and WS (3 bytes). This results in an offset for TS.val of 8 and TS.ecr of 12.
In the bad case the peer responds with the following TCP options: MSS (4 bytes), SACKOK (2 bytes), WS (3 bytes), TS (10 bytes), and EOL (1 byte). This results in an offset for TS.val of 11 and TS.ecr of 15.

The values are always modified in the correct place, only the checksum computation is wrong. My understanding of the checksum update code in pf is that it requires that the changed values are at an even offset. That is why it is breaking with the OVH server.

Just to be crystal clear: TCP options can start at any offset, so this is not a bug in the OVH server, it is definitely a bug in the pf code.
Comment 20 IPTRACE 2020-12-18 12:47:05 UTC
It's clear. Thank you for explanation.

In conclusion:
1. Does the error exist in PF directly in OpenBSD ? I mean the newset stable version of OpenBSD 6.8 (https://www.openbsd.org/68.html). If you know, of course.
2. Is it possible to fix the bug in this PF version on FreeBSD ?
.
.
X. Is it possible to port/fork/recompile Packet Filter from OpenBSD to FreeBSD ?

Regarding the last question, I think FreeBSB community need it for a dozen of years.
I read about porting PF from OpenBSD to FreeBSD a few years ago. There is discussion in community about this matter but I cannot find it now. If I find I insert the link.

Perhaps, it's a chance to adapt the newset PF in FreeBSD 13.0-RELEASE ?
Comment 22 Michael Tuexen freebsd_committer freebsd_triage 2020-12-18 13:03:36 UTC
For looking at the source code, I think the current code in OpenBSD handles this correctly:

/* modulate TS */
if (tsval && src->scrub &&
    (src->scrub->pfss_flags & PFSS_TIMESTAMP)) {
        /* tsval used further on */
        tsval = ntohl(tsval);
        pf_patch_32_unaligned(pd,
            ts, htonl(tsval + src->scrub->pfss_ts_mod),
            PF_ALGNMNT(ts - opts));
        copyback = 1;
}

I looked at this PR, because I work in the TCP area and wanted to make sure that this is not related to a bug in the TCP code. I think I'm done with this. I pinged kp@, who is working on the pf code. I don't know if there are any plans to update the pf code.

I could bring in the code from OpenBSD specific to fix this issue, but I'm not sure if the patch would be accepted by the people actually working on pf in FreeBSD.
Comment 23 Kristof Provost freebsd_committer freebsd_triage 2020-12-20 15:03:22 UTC
Thanks for the report and the excellent analysis.
I believe your assessment is correct.

I've managed to reproduce the problem in a relatively small test case: https://reviews.freebsd.org/D27688 and am experimenting with importing the unaligned checksum update code from OpenBSD.
Comment 24 Kristof Provost freebsd_committer freebsd_triage 2020-12-21 10:14:19 UTC
(In reply to Kristof Provost from comment #23)
Proposed fix here: https://reviews.freebsd.org/D27696

I'm pretty sure that you'd only see this issue with certain network hardware. Most MACs will just completely recompute checksums and hide this bug. Previous checksum issues manifested mostly on Xen systems.
Comment 25 Michael Tuexen freebsd_committer freebsd_triage 2020-12-21 10:40:37 UTC
(In reply to Kristof Provost from comment #24)

For incoming packets, NICs at most would try to validate the checksum and figure out that the checksum is wrong. This is the scenario we are looking into.

For outgoing packets, for the NICs I know, some precomputation by the software is needed, which isn't done by the code we are looking at. I also don't see `CSUM_TCP` or `CSUM_TCP_IPV6` being set.
Comment 26 Kristof Provost freebsd_committer freebsd_triage 2020-12-21 10:44:50 UTC
(In reply to Michael Tuexen from comment #25)
> For incoming packets, NICs at most would try to validate the checksum and figure out that the checksum is wrong. This is the scenario we are looking into.

Is it? My test case assume outbound corruption. I've not checked it, but I'd expect our stack to just ignore checksums if the hardware has verified them. So I wouldn't expect this issue to manifest in inbound traffic.

> For outgoing packets, for the NICs I know, some precomputation by the software is needed, which isn't done by the code we are looking at. I also don't see `CSUM_TCP` or `CSUM_TCP_IPV6` being set.

That gets complicated. My recollection from the Xen checksum issue is that it depends on the hardware. Some hardware has no checksum support at all (although that's rare these days), some computes it fully independently (and we wouldn't see this issue there, because the incorrect checksum will be fixed up) and some expects a partial 'pseudo header' checksum to be precomputed and updates the checksum based on the actual payload. That style of hardware would be affected as well.

Anyway, it's clear that the pf checksum update code is incorrect in some cases, and we should fix that, regardless of what hardware does later.
Comment 27 Michael Tuexen freebsd_committer freebsd_triage 2020-12-21 11:10:49 UTC
(In reply to Kristof Provost from comment #26)
> > For incoming packets, NICs at most would try to validate the checksum and figure out that the checksum is wrong. This is the scenario we are looking into.


>Is it? My test case assume outbound corruption. I've not checked it, but I'd expect >our stack to just ignore checksums if the hardware has verified them. So I wouldn't >expect this issue to manifest in inbound traffic.

Just to be precise: The host OS (which is running pf, I guess) receives the SYN-ACK with a correct checksum. Then pf processes it and this results in an incorrect checksum. Then the packet is provided to the VM. My understanding is that the receive side in the VM is dropping the packet. This is why I would say that it is incoming traffic. This could be validated by looking at the output of netstat when running in the VM: netstat -s -ptcp
Looking at the line "discarded for bad checksums" should give this number.

Are you saying the m_pkthdr.csum_flags of the packet received by the host OS are the same of the packet received by the guest OS?

> >For outgoing packets, for the NICs I know, some precomputation by the software is needed, which isn't done by the code we are looking at. I also don't see `CSUM_TCP` or `CSUM_TCP_IPV6` being set.


>That gets complicated. My recollection from the Xen checksum issue is that it depends >on the hardware. Some hardware has no checksum support at all (although that's rare >these days), some computes it fully independently (and we wouldn't see this issue >there, because the incorrect checksum will be fixed up) and some expects a partial >'pseudo header' checksum to be precomputed and updates the checksum based on the >actual payload. That style of hardware would be affected as well.

I agree. I was referring to the pseudo header...

>Anyway, it's clear that the pf checksum update code is incorrect in some cases, and we should fix that, regardless of what hardware does later.

I completely agree. Thanks a lot for fixing it!
Comment 28 Kristof Provost freebsd_committer freebsd_triage 2020-12-21 11:13:16 UTC
(In reply to Michael Tuexen from comment #27)
> Just to be precise: The host OS (which is running pf, I guess) receives the SYN-ACK with a correct checksum. Then pf processes it and this results in an incorrect checksum. Then the packet is provided to the VM. My understanding is that the receive side in the VM is dropping the packet. This is why I would say that it is incoming traffic. 

Oh, I see. I'd call that outgoing traffic, at least from the perspective of pf on the host machine. That explains our "disagreement". I think we're on the same page.
Comment 30 commit-hook freebsd_committer freebsd_triage 2021-01-03 22:19:47 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5c712c8748f304f57e18b09f890b46d5f3d13a2e

commit 5c712c8748f304f57e18b09f890b46d5f3d13a2e
Author:     Kristof Provost <kp@FreeBSD.org>
AuthorDate: 2020-12-19 15:06:03 +0000
Commit:     Kristof Provost <kp@FreeBSD.org>
CommitDate: 2021-01-03 22:11:08 +0000

    pf tests: Verify (tcp) checksum modification on unaligned options

    It turns out pf incorrectly updates the TCP checksum if the TCP option
    we're modifying is not 2-byte algined with respect to the start of the
    packet.

    Create a TCP packet with such an option and throw it through a scrub
    rule, which will update timestamps and modify the packet.

    PR:             240416
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D27688

    (cherry picked from commit 2d3fda5fa1dc99aa8788e5f8d8bb71e682101063)

 tests/sys/netpfil/common/pft_ping.py   | 69 +++++++++++++++++++++++++--
 tests/sys/netpfil/pf/Makefile          |  1 +
 tests/sys/netpfil/pf/checksum.sh (new) | 85 ++++++++++++++++++++++++++++++++++
 3 files changed, 151 insertions(+), 4 deletions(-)
Comment 31 commit-hook freebsd_committer freebsd_triage 2021-01-03 22:20:23 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=21745738a2b5662dfbe730b6338aa38e829cb0eb

commit 21745738a2b5662dfbe730b6338aa38e829cb0eb
Author:     Kristof Provost <kp@FreeBSD.org>
AuthorDate: 2020-12-20 20:06:32 +0000
Commit:     Kristof Provost <kp@FreeBSD.org>
CommitDate: 2021-01-03 22:11:08 +0000

    pf: Fix unaligned checksum updates

    The algorithm we use to update checksums only works correctly if the
    updated data is aligned on 16-bit boundaries (relative to the start of
    the packet).

    Import the OpenBSD fix for this issue.

    PR:             240416
    Obtained from:  OpenBSD
    MFC after:      1 week
    Reviewed by:    tuexen (previous version)
    Differential Revision:  https://reviews.freebsd.org/D27696

    (cherry picked from commit c3f69af03ae7acc167cc1151f0c1ecc5e014ce4e)

 sys/net/pfvar.h          |  5 +++
 sys/netpfil/pf/pf.c      | 81 +++++++++++++++++++++++++++++++++++++++---------
 sys/netpfil/pf/pf_norm.c | 23 ++++++++++----
 3 files changed, 89 insertions(+), 20 deletions(-)