Summary: | Virtio net consistently truncates last byte of a fetch xfer with > 8956 bytes of payload | ||
---|---|---|---|
Product: | Base System | Reporter: | Guest <bugmenot> |
Component: | kern | Assignee: | freebsd-virtualization (Nobody) <virtualization> |
Status: | Closed Not A Bug | ||
Severity: | Affects Some People | CC: | adam.chappell, allanjude, freebsd-bugs, jrtc27, net, olevole, rgrimes |
Priority: | --- | ||
Version: | 12.0-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any | ||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=215737 |
Description
Guest
2019-04-21 16:06:25 UTC
qemu was installed via brew. Additional information: OpenBSD using virtio has almost exactly the same problem--one byte truncation when trying to download packages (down to the tcpdump output showing a complete final payload packet but ktrace showing the ftp utility not receiving the final byte). Bizarrely, OpenBSD downloaded and installed packages via the network using virtio so it's unclear why this seems to work intermittently. In any case, with OpenBSD having almost identical behavior, I am unconvinced this is a FreeBSD issue. Are jumbo frames in use some place along the path? I can reproduce this on macOS 10.13.6 (17G7024) High Sierra with qemu 4.0.0 and a FreeBSD 12.0-RELEASE (p1-p6) guest. My packer freebsd builder failed because of this issue. I have tested this for a while with the same template. In most cases, the builder fails (truncated base.txz or truncated pkgng packages). Occasionally, the download and installation are successful. I booted one of these successfully created images with qemu and ran additional tests. Test #1: fetch http://www.google.de The last byte is missing. Test #2: ping google.de PING google.de (172.217.23.163): 56 data bytes 64 bytes from 172.217.23.163: icmp_seq=0 ttl=255 time=622018725671.832 ms wrong data byte #8 should be 0x8 but was 0xc0 [...] Test #3: pkg install curl wget One successful attempt after many truncated downloads. Test #4: curl http://www.google.de No issues Test #5: wget http://www.google.de No issues (In reply to Rodney W. Grimes from comment #3) I haven't changed mtu on any interface. Hosts external interface is 1500, hosts gateway uses 1500 and guests vtnet0 is 1500. I think this is not a FreeBSD issue. Can't reproduce immediately after a host reboot. The issue occurs after the first sleep/wake cycle of the host and persists until reboot. This seems to be a macOS and/or qemu issue. (In reply to Christoph Kliemann from comment #6) Please disregard. Managed to reproduce after a reboot. Sorry for the noise. Fixed in FreeBSD 12.1 Would be intrigued to understand what the FreeBSD fix was here. Doesn't seem to be in release notes. I believe this issue is not an issue with the FreeBSD guest but more likely an issue with the MacOS/Darwin poll() returning POLLPRI events to Qemu's userland TCP stack, Slirp. When Slirp sees POLLPRI on a TCP stream it assumes (not unreasonably) that the incoming data has some urgent data in it. It makes some effort to craft a TCP segment for the guest with URG flag and pointer set to a best guess. Unfortunately the guest VM's read() won't return urgent/OOB data in normal operation. As a result, data is omitted. From my tests it seems very prevalent that MacOS poll() returns POLLPRI on the last segment (perhaps it's signalling POLLPRI to tell the reader that the stream has finished?), which does explain why we lose the last byte or so. Lopping out the (revents & SLIRP_POLL_PRI) clause in slirp.c:slirp_pollfds_poll() in favour of the subsequent else-if makes things work, at the cost of NOPing out Slirp's likely hapless attempts to do URG reconstruction. RFC6093 seems to push us away from ever using TCP urgent in new apps, so maybe that's not as bad as it seems. (In reply to Adam Chappell from comment #9) There wasn't one, it's still broken, we've independently been trying to work out what on earth was going on causing us to see the same thing (without realising it was only affecting macOS hosts) until we stumbled upon this report. Not a bug in FreeBSD (well, unless MSG_OOB should be enforced, but then every OS has the same bug), just POLLPRI being extremely ill-defined, SLiRP trying to be helpful and TCP urgent being ubiquitously misunderstood all interacting together to result in this unfortunate outcome. Should no longer occur once QEMU pulls in https://gitlab.freedesktop.org/slirp/libslirp/-/commit/7271345efe182199acaeae602cb78a94a7c6dc9d; thanks for figuring that one out so we didn't have to. It appears this is a libslirp issue on macOS. Rebuilding QEMU with slirp updated to include https://gitlab.freedesktop.org/slirp/libslirp/-/commit/7271345efe182199acaeae602cb78a94a7c6dc9d fixes this issue for me. See also https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35 I've filed https://github.com/Homebrew/homebrew-core/issues/73517 |