Bug 280036

Summary: Data corruption over if_ovpn (OpenVPN DCO) observed
Product: Base System Reporter: ss3bsd <3226388001>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Closed FIXED    
Severity: Affects Some People CC: ae, franco, kib, kp, markj, mike, zarychtam, zlei
Priority: ---    
Version: 14.1-RELEASE   
Hardware: Any   
OS: Any   

Description ss3bsd 2024-06-28 12:31:50 UTC
Steps to Reproduce:

- Run two instances (Host A and Host B) of FreeBSD using VMware player on Windows

 Image used: https://download.freebsd.org/releases/VM-IMAGES/14.1-RELEASE/amd64/Latest/FreeBSD-14.1-RELEASE-amd64.vmdk.xz

  Note: The VM environment (VMware player) is not relevant because this issue is first found on a physical machine.


- Do basic network config of the hosts (ip address, default gateway etc.)


- Install OpenVPN at each host

 # pkg install openvpn


- Create openvpn config files (they are derivations of /usr/tests/sys/net/if_ovpn/if_ovpn)

 <Host A openvpn config>

 dev ovpn0
 dev-type tun
 proto udp4
 
 cipher AES-256-GCM
 auth SHA256
 
 local 192.168.XXX.YYY
 server 198.51.100.0 255.255.255.0
 ca /usr/tests/sys/net/if_ovpn/ca.crt
 cert /usr/tests/sys/net/if_ovpn/server.crt
 key /usr/tests/sys/net/if_ovpn/server.key
 dh /usr/tests/sys/net/if_ovpn/dh.pem
 
 mode server
 script-security 2
 auth-user-pass-verify /usr/bin/true via-env
 topology subnet
 
 keepalive 100 600


 <Host B openvpn config>

 dev tun0
 dev-type tun
 
 client
 
 remote 192.168.XXX.YYY
 auth-user-pass /usr/tests/sys/net/if_ovpn/user.pass
 
 ca /usr/tests/sys/net/if_ovpn/ca.crt
 cert /usr/tests/sys/net/if_ovpn/client.crt
 key /usr/tests/sys/net/if_ovpn/client.key
 dh /usr/tests/sys/net/if_ovpn/dh.pem
 
 keepalive 100 600
 

- Prepare Host A

  ** adduser (USERNAME)
  
  ** edit /etc/inetd.conf and enable ftpd.
  
    # /etc/rc.d/inetd onestart
    Starting inetd.
  
    Note: this issue is not specific to ftp. This is just for a simple test.
  
  ** start openvpn

    # openvpn --config HOST_A_CONFIG_FILE



- Test at Host B

  ** start openvpn

    # openvpn --config HOST_B_CONFIG_FILE &

  ** prepare test data

    # dd if=/dev/random bs=1M count=100 of=randomfile

  ** transfer data over if_ovpn (put and get back)

    # ftp 198.51.100.1
    Connected to 198.51.100.1.
    220 freebsd FTP server (Version 6.00LS) ready.
    Name (198.51.100.1:root): USERNAME
    331 Password required for USERNAME.
    Password:
    230 User USERNAME logged in.
    Remote system type is UNIX.
    Using binary mode to transfer files.
    
    ftp> put randomfile
    local: randomfile remote: randomfile
    229 Entering Extended Passive Mode (|||50636|)
    150 Opening BINARY mode data connection for 'randomfile'.
    100% |********************************************************************************************************|   100 MiB   22.87 MiB/s    00:00 ETA
    226 Transfer complete.
    104857600 bytes sent in 00:04 (22.80 MiB/s)
    
    ftp> get randomfile randomfile.returned
    local: randomfile.returned remote: randomfile
    229 Entering Extended Passive Mode (|||58633|)
    150 Opening BINARY mode data connection for 'randomfile' (104857600 bytes).
    100% |********************************************************************************************************|   100 MiB   25.26 MiB/s    00:00 ETA
    226 Transfer complete.
    104857600 bytes received in 00:03 (25.26 MiB/s)
    
    ftp> bye
    221 Goodbye.

  ** compare the files

    # diff randomfile randomfile.returned
    Binary files randomfile and randomfile.returned differ
    
    # ll randomfile*
    -rw-r--r--  1 root wheel 104857600 Jun 28 20:18 randomfile
    -rw-r--r--  1 root wheel 104857600 Jun 28 20:19 randomfile.returned
    
    # md5sum randomfile*
    8008cf7f76ea6b1b3f8a85030f226ec9  randomfile
    f2c09d5bf4891e82bd38d8af7c2775b7  randomfile.returned
    
    Note: The larger the file, the higher the chance of data corruption.
Comment 1 Andrey V. Elsukov freebsd_committer freebsd_triage 2024-07-01 11:45:36 UTC
Did you run the same test with --disable-dco option?
Comment 2 ss3bsd 2024-07-01 12:20:12 UTC
(In reply to Andrey V. Elsukov from comment #1)

With a different config at a physical machine, yes, and the issue did not occur if disable-dco is set.

 # openvpn --config CONFIG --disable-dco


* Perhaps the sample configs I showed in the reproduce step need to be modified to use the disable-dco option since the option is not compatible with dev ovpn0.
Comment 3 ss3bsd 2024-07-21 10:46:40 UTC
Has anyone else reproduced this?

Since I have already decided not to use if_ovpn, this additional comment below will be the last one unless someone can reproduce the issue.

--

The issue has been confirmed using Hyper-V (10.0.22621 [SP3]) too.
Also, another issue (server-side file corrption during FTP GET over if_oven) is observed there.

[procedure #2] 
  Test condition is almost same as the already reported one.
  Smaller test file is enough this time, but GET multiple times instead.
  
  ** prepare test data
   # dd if=/dev/random bs=1M count=4 of=randomfile
  
  
  ** transfer data over if_ovpn (put and get back)
  
  ftp> put randomfile
  local: randomfile remote: randomfile
  229 Entering Extended Passive Mode (|||56921|)
  150 Opening BINARY mode data connection for 'randomfile'.
  100% |********************************************************************************************************|  4096 KiB  107.39 MiB/s    00:00 ETA
  226 Transfer complete.
  4194304 bytes sent in 00:00 (104.75 MiB/s)
  
  
  ftp> get randomfile randomfile.returned1
  local: randomfile.returned1 remote: randomfile
  229 Entering Extended Passive Mode (|||52890|)
  150 Opening BINARY mode data connection for 'randomfile' (4194304 bytes).
  100% |********************************************************************************************************|  4096 KiB  106.43 MiB/s    00:00 ETA
  226 Transfer complete.
  4194304 bytes received in 00:00 (106.16 MiB/s)
  
  ftp> get randomfile randomfile.returned2
  local: randomfile.returned2 remote: randomfile
  229 Entering Extended Passive Mode (|||51626|)
  150 Opening BINARY mode data connection for 'randomfile' (4194304 bytes).
  100% |********************************************************************************************************|  4096 KiB  112.00 MiB/s    00:00 ETA
  226 Transfer complete.
  4194304 bytes received in 00:00 (111.72 MiB/s)
  
  ftp> get randomfile randomfile.returned3
  local: randomfile.returned3 remote: randomfile
  229 Entering Extended Passive Mode (|||50186|)
  150 Opening BINARY mode data connection for 'randomfile' (4194304 bytes).
  100% |********************************************************************************************************|  4096 KiB  116.54 MiB/s    00:00 ETA
  226 Transfer complete.
  4194304 bytes received in 00:00 (116.23 MiB/s)
  
  ftp> get randomfile randomfile.returned4
  local: randomfile.returned4 remote: randomfile
  229 Entering Extended Passive Mode (|||56341|)
  150 Opening BINARY mode data connection for 'randomfile' (4194304 bytes).
  100% |********************************************************************************************************|  4096 KiB   96.91 MiB/s    00:00 ETA
  226 Transfer complete.
  4194304 bytes received in 00:00 (96.68 MiB/s)
  
  ftp> bye
  221 Goodbye.
  
  # md5sum randomfile*
  76b3ea410e4d56b9b684116a2a8319cb  randomfile
  76b3ea410e4d56b9b684116a2a8319cb  randomfile.returned1  -- file is small, so corruption at the network did not occur this time
  13f1d329379862cd1e157eeb55c6a436  randomfile.returned2  -- the server-side file has been destroyed by the previous GET (*)
  4d3846965ad68016d545ae4222a33ed5  randomfile.returned3  -- same as above
  ebfc8ff8ee13841599562a8a211c5b31  randomfile.returned4  -- same as above

   (*) example of the file corruption by GET command.
     -- At the ftp server side --
     * before "get randomfile randomfile.returned4"
       # md5sum randomfile
       ebfc8ff8ee13841599562a8a211c5b31  randomfile  -- matches to randomfile.returned4 at the client side later

     * after "get randomfile randomfile.returned4"
       # md5sum randomfile
       5a49c61306c82a4c57768876b95c4ae4  randomfile  -- destroyed
Comment 4 Zhenlei Huang freebsd_committer freebsd_triage 2024-08-30 09:11:41 UTC
(In reply to ss3bsd from comment #3)
I guess a small file ( 1M bytes ) is enough to reproduce. Maybe require multiple rounds though.

> # md5sum randomfile*
This is good to check the corruption, but less useful to figure out how many bytes got corrupted.

May you please also share the binary diff ? You can do that by:
```
# diff <( hexdump -x randomfile ) <( hexdump -x randomfile.return )
```
Comment 5 Marek Zarychta 2024-08-30 21:14:45 UTC
Setting "tun-mtu 1400" on the OpenVPN access server should work as a workaround. Please see bug 276838 for more details.
Comment 6 Marek Zarychta 2024-08-30 21:19:38 UTC
FWIW: there is another mild flaw with the FreeBSD implementation of DCO: the ovpn(4) module doesn't support "multihome" option (more details in bug 273664).
Comment 7 Marek Zarychta 2024-08-30 22:00:13 UTC
Please let me apologise since I hastily connected this bug with another in comment #5 misleading the audience. These bugs seem to be unrelated to each other.

To not hijack this PR but to give more insight into the problem let me add some details:
The bug described here is reproducible, but to narrow the scope I observed that only data sent directly from OpenVPN DCO accelerated FreeBSD-driven access server to the FreeBSD client was corrupted. The corruption doesn't affect network packets transferred (routed) via OpenVPN DCO accelerated FreeBSD access server.
Comment 8 ss3bsd 2024-08-31 04:52:28 UTC
(In reply to Zhenlei Huang from comment #4)

I guess the file is entirely corrupted.

---

# md5sum randomfile*
4d1b051f8bdea93ddca5d8831a0a829c  randomfile
4d1b051f8bdea93ddca5d8831a0a829c  randomfile.returned1
dd8e4c1a9839976c9e812456ecf71b3a  randomfile.returned2
de2d31f956a53a0a86e64058886eb1e9  randomfile.returned3
9432cd3c1ec5540e5251f657e18b93a8  randomfile.returned4


# hexdump -x randomfile > randomfile.hex
# hexdump -x randomfile.returned2 > randomfile.returned2.hex

# head randomfile.hex
0000000    0595    a291    c861    c7ee    c67a    59ab    73ca    66e3
0000010    29ca    9d14    d874    fe71    2fe2    76ae    93b0    cb1a
0000020    df5b    a72f    2256    1e88    337d    d9a8    4a12    0923
0000030    5a72    56f6    64f8    647a    7a2c    a33b    5f0b    87ec
0000040    e86e    fe3e    62a3    e1f4    992a    3fc1    7535    c273
0000050    40f1    ce42    6895    cf20    e272    74b5    485a    8f9b
0000060    584e    3431    78bd    f48c    efd4    cfc8    cd19    8ca7
0000070    cf09    78b2    a189    3d08    4c4e    3acb    057d    3a20
0000080    06d9    2c22    a667    8b38    0aaa    7176    c86f    71b3
0000090    9dd8    6128    4a08    2475    674d    f801    daf7    4fe9

# head randomfile.returned2.hex
0000000    9ae9    866b    3498    82ee    aef9    abeb    af8a    6ae4
0000010    ce04    3671    301d    af4e    fab0    f01d    890a    a143
0000020    8635    7c19    982f    0dcf    34e3    c09e    9992    54ad
0000030    1a97    2a3b    78c4    b3a9    3230    3a67    3649    5f97
0000040    5944    7980    adde    5f02    1e35    73f0    4b7d    32ee
0000050    bd6d    514a    4065    9817    fe0d    6d74    7736    626e
0000060    ef27    d331    7f8b    2ebb    5568    9b92    7264    7846
0000070    606a    0132    d1c9    1a3e    230e    4061    3897    c873
0000080    78ea    e2f7    6f54    38c2    ceb9    413d    557f    20ea
0000090    3ac1    3367    7f3b    01bf    fc96    75bb    549a    5b3e

# diff randomfile.hex randomfile.returned2.hex > randomfile.hex.diff

# wc *.hex*
  262145 2359297 18874376 randomfile.hex
  524290 5242882 38797334 randomfile.hex.diff
  262145 2359297 18874376 randomfile.returned2.hex
 1048580 9961476 76546086 total
Comment 9 ss3bsd 2024-08-31 05:19:29 UTC
(In reply to Marek Zarychta from comment #7)

Hi.

I had checked your PR before reporting this issue.
Reducing MTU did not solve this, so I agree that your PR and mine are separate issues.

Thanks.
Comment 10 Marek Zarychta 2024-09-03 11:13:37 UTC
While investigating the bug, I found that when ftpd(8) is serving a file from the OpenVPN access server itself (from the ZFS filesystem in my case), the file silently (apparently!) gets corrupted on that SERVER! To reproduce, connect from another machine to the FTP server (running on OpenVPN access server) offering read access to the file on the ZFS dataset and download the file a few times, checking if the checksum of the file served remains the same (see the experiment below). In my case, the checksum of the file on the server changed with each file access via ftpd(8), but only if it was served from the OpenVPN access server and was accessed via the OpenVPN tunnel. When the file was accessed by OpenVPN tunnel but, terminated on another access server (also running FreeBSD and using OpenVPN DCO module!), everything was fine. Moreover, the file never got corrupted when the connection was using a Wireguard tunnel terminated on the same machine or the connection was established without any tunnelling.

1. CLient PC:
[homepc] /tmp% lftp ftp://somehost
lftp somehost:~> set xfer:clobber yes
lftp somehost:~> get file
81362184 bajty przesłane w ciągu 3 sekund (22.60 MiB/s)
lftp somehost:/> get file
81362184 bajty przesłane w ciągu 3 sekund (23.17 MiB/s)
lftp somehost:/> get file
81362184 bajty przesłane w ciągu 3 sekund (23.17 MiB/s)
lftp somehost:/>
(...)

2. FreeBSD OpenVPN access server and ftpd(8) server:
[somehost] ~ftp# stat file ; sha1sum file
0:13:54 2019" 131072 152537 0x800 file
589ca9781368ffd66f35394f3b7bf43ab9d98b33  file
[somehost] ~ftp# stat file ; sha1sum file
11075423362389754006 262491 -rw-rw-r-- 2 root wheel 0 81362184 "sty  1 01:00:00 1970" "lip  8 10:13:54 2019" "sie 31 00:40:13 2024" "lip  8 1
0:13:54 2019" 131072 152537 0x800 file
05ba9b9a60df861887dcc690c1237998c1844180  file
[somehost] ~ftp# stat file ; sha1sum file
11075423362389754006 262491 -rw-rw-r-- 2 root wheel 0 81362184 "sty  1 01:00:00 1970" "lip  8 10:13:54 2019" "sie 31 00:40:13 2024" "lip  8 1
0:13:54 2019" 131072 152537 0x800 file
d259dbd577c17b5ad7da0d1d0fa67f5564a68e72  file
[somehost] ~ftp# zpool scrub zroot
[somehost] ~ftp# zpool status
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:02:32 with 0 errors on Tue Sep  3 12:37:03 2024
config:
 
      NAME             STATE     READ WRITE CKSUM
      zroot            ONLINE       0     0     0
        gpt/ssdrootfs  ONLINE       0     0     0
 
errors: No known data errors
[somehost] ~ftp# stat file ; sha1sum file
11075423362389754006 262491 -rw-rw-r-- 2 root wheel 0 81362184 "sty  1 01:00:00 1970" "lip  8 10:13:54 2019" "sie 31 00:40:13 2024" "lip  8 1
0:13:54 2019" 131072 152537 0x800 file
fdf68d23439f5a269850e71e1807ea8889a8ffd0  file
[somehost] ~ftp# stat file ; sha1sum file
11075423362389754006 262491 -rw-rw-r-- 2 root wheel 0 81362184 "sty  1 01:00:00 1970" "lip  8 10:13:54 2019" "sie 31 00:40:13 2024" "lip  8 1
0:13:54 2019" 131072 152537 0x800 file
7674bcd2cbfdff3c5455e287c609b3364c311914  file
Comment 11 Konstantin Belousov freebsd_committer freebsd_triage 2024-09-03 22:07:26 UTC
(In reply to Marek Zarychta from comment #10)
ftpd uses sendfile(2) AFAIR, and the symptoms of the file corruption mean
that M_RDONLY mbuf, of type M_EXTPG, modified by the network stack.
Comment 12 Kristof Provost freebsd_committer freebsd_triage 2024-09-04 12:57:28 UTC
I still can't reproduce this, but kib's remark makes a lot of sense. This should fix the problem:

https://reviews.freebsd.org/D46529
Comment 13 ss3bsd 2024-09-04 17:18:59 UTC
(In reply to Kristof Provost from comment #12)

This patch seems to work. Thank you.
Comment 14 Marek Zarychta 2024-09-04 17:48:37 UTC
(In reply to Kristof Provost from comment #12)
With the patch from the review D46529 applied, I can no longer reproduce the original issue reported here by ss3bsd and the issue described by me in comment #10. Thank you for the fix, please push it.
Comment 15 Marek Zarychta 2024-09-04 18:47:07 UTC
(In reply to Kristof Provost from comment #12)
>I still can't reproduce this

A few questions arise: Could the issue described in comment #10 be reproducible when sendfile(2) uses UFS as a backend? 
Is sendfile(2) on ZFS reliable and safe? AFAIR there were issues with sendfile(8) and ZFS in the past. 

FWIW: the file I have experimented with seems to be untouched, at least the checksums of the file and those taken from various historical ZFS snapshots are the same.
Comment 16 commit-hook freebsd_committer freebsd_triage 2024-09-05 09:00:34 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5644e2c6d47c6113a61ab7fc0776b7227677656a

commit 5644e2c6d47c6113a61ab7fc0776b7227677656a
Author:     Kristof Provost <kp@FreeBSD.org>
AuthorDate: 2024-09-04 12:54:23 +0000
Commit:     Kristof Provost <kp@FreeBSD.org>
CommitDate: 2024-09-05 08:59:31 +0000

    if_ovpn: ensure it's safe to modify the mbuf

    PR:             280036
    Reviewed by:    ae
    MFC after:      1 week
    Sponsored by:   Rubicon Communications, LLC ("Netgate")
    Differential Revision:  https://reviews.freebsd.org/D46529

 sys/net/if_ovpn.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)
Comment 17 commit-hook freebsd_committer freebsd_triage 2024-09-12 11:53:33 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8f251937850142748cdf67a46630342934ff9f91

commit 8f251937850142748cdf67a46630342934ff9f91
Author:     Kristof Provost <kp@FreeBSD.org>
AuthorDate: 2024-09-04 12:54:23 +0000
Commit:     Kristof Provost <kp@FreeBSD.org>
CommitDate: 2024-09-12 07:57:04 +0000

    if_ovpn: ensure it's safe to modify the mbuf

    PR:             280036
    Reviewed by:    ae
    MFC after:      1 week
    Sponsored by:   Rubicon Communications, LLC ("Netgate")
    Differential Revision:  https://reviews.freebsd.org/D46529

    (cherry picked from commit 5644e2c6d47c6113a61ab7fc0776b7227677656a)

 sys/net/if_ovpn.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)