Bug 237441

Summary: Virtio net consistently truncates last byte of a fetch xfer with > 8956 bytes of payload
Product: Base System Reporter: Guest <bugmenot>
Component: kernAssignee: freebsd-virtualization (Nobody) <virtualization>
Status: Closed Not A Bug    
Severity: Affects Some People CC: adam.chappell, allanjude, freebsd-bugs, jrtc27, net, olevole, rgrimes
Priority: ---    
Version: 12.0-RELEASE   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=215737

Description Guest 2019-04-21 16:06:25 UTC
Reading 215737 carefully, I couldn't decide if this was the same problem but ultimately decided it wasn't.

Environment:  OSX High Sierra running QEMU and the 12.0 release qcow2 image published on the FreeBSD site.

Qemu command line:  qemu-system-x86_64 -m 2048 -hda FreeBSD-12.0-RELEASE-amd64.qcow2  -netdev user,id=mynet0,hostfwd=tcp:127.0.0.1:7722-:22 -device virtio-net-pci,netdev=mynet0

Trying to install pkg fails.  If you do the following command:

fetch http://pkg.freebsd.org/FreeBSD:12:amd64/latest/Latest/pkg.txz

you will consistently get the following (Note:  with or without [TR]XCSUM enabled):

fetch: pkg.txz appears to be truncated: 3395051/3395052 bytes

If you download the full package and use dd to grab all but the last byte, the SHA256 sums match so the data's not corrupted, just missing the final byte (a 'Z') character.  Furthermore, if you run tcpdump in the guest against the vtnet0 interface while it's transferring you can see the final 'Z' byte in the final packet so qemu is getting the data to the guest.  If you then ktrace the fetch process, you'll see that its final read *doesn't* have the 'Z' which rules out a bug in fetch/libfetch.

Using fetch to test for sizing, I started downloading packages at the jumbo frame boundary and found that packages <= 8956 bytes work and >= 8960 exhibit the failure.
Comment 1 Guest 2019-04-21 16:06:55 UTC
qemu was installed via brew.
Comment 2 Guest 2019-04-21 17:34:53 UTC
Additional information:  OpenBSD using virtio has almost exactly the same problem--one byte truncation when trying to download packages (down to the tcpdump output showing a complete final payload packet but ktrace showing the ftp utility not receiving the final byte).  Bizarrely, OpenBSD downloaded and installed packages via the network using virtio so it's unclear why this seems to work intermittently.

In any case, with OpenBSD having almost identical behavior, I am unconvinced this is a FreeBSD issue.
Comment 3 Rodney W. Grimes freebsd_committer freebsd_triage 2019-05-25 09:38:05 UTC
Are jumbo frames in use some place along the path?
Comment 4 Christoph Kliemann 2019-06-27 23:19:25 UTC
I can reproduce this on macOS 10.13.6 (17G7024) High Sierra with qemu 4.0.0 and a FreeBSD 12.0-RELEASE (p1-p6) guest.

My packer freebsd builder failed because of this issue.
I have tested this for a while with the same template.

In most cases, the builder fails (truncated base.txz or truncated pkgng packages).
Occasionally, the download and installation are successful.

I booted one of these successfully created images with qemu and ran additional tests.

Test #1: fetch http://www.google.de
The last byte is missing.

Test #2: ping google.de
PING google.de (172.217.23.163): 56 data bytes
64 bytes from 172.217.23.163: icmp_seq=0 ttl=255 time=622018725671.832 ms
wrong data byte #8 should be 0x8 but was 0xc0
[...]

Test #3: pkg install curl wget
One successful attempt after many truncated downloads.

Test #4: curl http://www.google.de
No issues

Test #5: wget http://www.google.de
No issues
Comment 5 Christoph Kliemann 2019-06-27 23:36:28 UTC
(In reply to Rodney W. Grimes from comment #3)
I haven't changed mtu on any interface.
Hosts external interface is 1500, hosts gateway uses 1500 and guests vtnet0 is 1500.
Comment 6 Christoph Kliemann 2019-07-13 13:50:36 UTC
I think this is not a FreeBSD issue.

Can't reproduce immediately after a host reboot.
The issue occurs after the first sleep/wake cycle of the host and persists until reboot.
This seems to be a macOS and/or qemu issue.
Comment 7 Christoph Kliemann 2019-07-13 15:05:57 UTC
(In reply to Christoph Kliemann from comment #6)

Please disregard. Managed to reproduce after a reboot. Sorry for the noise.
Comment 8 Allan Jude freebsd_committer freebsd_triage 2020-07-17 16:32:39 UTC
Fixed in FreeBSD 12.1
Comment 9 Adam Chappell 2021-02-10 18:08:57 UTC
Would be intrigued to understand what the FreeBSD fix was here. Doesn't seem to be in release notes. I believe this issue is not an issue with the FreeBSD guest but more likely an issue with the MacOS/Darwin poll() returning POLLPRI events to Qemu's userland TCP stack, Slirp. When Slirp sees POLLPRI on a TCP stream it assumes (not unreasonably) that the incoming data has some urgent data in it. It makes some effort to craft a TCP segment 
for the guest with URG flag and pointer set to a best guess.

Unfortunately the guest VM's read() won't return urgent/OOB data in normal operation. As a result, data is omitted.

From my tests it seems very prevalent that MacOS poll() returns POLLPRI on the last segment (perhaps it's signalling POLLPRI to tell the reader that the stream has finished?), which does explain why we lose the last byte or so.

Lopping out the (revents & SLIRP_POLL_PRI)
clause in slirp.c:slirp_pollfds_poll() in favour of the subsequent else-if makes things work, at the cost of NOPing out Slirp's likely hapless attempts to do URG reconstruction.

RFC6093 seems to push us away from ever using TCP urgent in new apps, so maybe that's not as bad as it seems.
Comment 10 Jessica Clarke freebsd_committer freebsd_triage 2021-03-19 13:26:56 UTC
(In reply to Adam Chappell from comment #9)

There wasn't one, it's still broken, we've independently been trying to work out what on earth was going on causing us to see the same thing (without realising it was only affecting macOS hosts) until we stumbled upon this report.

Not a bug in FreeBSD (well, unless MSG_OOB should be enforced, but then every OS has the same bug), just POLLPRI being extremely ill-defined, SLiRP trying to be helpful and TCP urgent being ubiquitously misunderstood all interacting together to result in this unfortunate outcome. Should no longer occur once QEMU pulls in https://gitlab.freedesktop.org/slirp/libslirp/-/commit/7271345efe182199acaeae602cb78a94a7c6dc9d; thanks for figuring that one out so we didn't have to.
Comment 11 Alex Richardson freebsd_committer freebsd_triage 2021-03-19 13:58:51 UTC
It appears this is a libslirp issue on macOS.

Rebuilding QEMU with slirp updated to include https://gitlab.freedesktop.org/slirp/libslirp/-/commit/7271345efe182199acaeae602cb78a94a7c6dc9d fixes this issue for me.

See also https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35

I've filed https://github.com/Homebrew/homebrew-core/issues/73517