Bug 215737 - [bhyve] utilizing virtio-net truncates jumbo frames at 4084 bytes length
Summary: [bhyve] utilizing virtio-net truncates jumbo frames at 4084 bytes length
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-03 16:19 UTC by Harald Schmalzbauer
Modified: 2019-05-20 15:27 UTC (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Harald Schmalzbauer 2017-01-03 16:19:52 UTC
Steps to reproduce:
'ifconfig create vmnet0 mtu 9000'
'ifconfig brigde0 addm vmnet0'

Set guest mtu to 9000 (which vtnet(4) claims to support).
Now we can transceive frames up to 4084 bytes,
which this flow from the guest's vtnet(4) interface demonstrates:
16:54:36.672709 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4
(0x0800), length 4084: 172.21.35.1 > 172.21.35.32: ICMP echo request, id
56840, seq 0, length 4050
16:54:36.672791 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4
(0x0800), length 4084: 172.21.35.32 > 172.21.35.1: ICMP echo reply, id
56840, seq 0, length 4050
On the host this looks similar.

Now with a payload size of 4043 instead of 4042 bytes, the reply never
makes it through virtio-net:
Host flow:
16:57:06.641382 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4
(0x0800), length 4085: 172.21.35.1 > 172.21.35.32: ICMP echo request, id
27401, seq 0, length 4051
16:57:06.641399 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4
(0x0800), length 4085: 172.21.35.32 > 172.21.35.1: ICMP echo reply, id
27401, seq 0, length 4051
Guest flow:
16:57:06.642073 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4
(0x0800), length 4085: 172.21.35.1 > 172.21.35.32: ICMP echo request, id
27401, seq 0, length 4051
16:57:06.642233 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4
(0x0800), length 4084: truncated-ip - 1 bytes missing! 172.21.35.32 >
172.21.35.1: ICMP echo reply, id 27401, seq 0, length 405

When using exactly the same setup, just replacing virtio-net with e1000 ('-s 5,virtio-net,vmnet0' with '-s 5,e1000,vmnet0'), jumbo frames do work as expected.

Andrey V. Elsukov idea:
> This looks like the problem with mbufs bigger than PAGE_SIZE.
> Do you see some denied requests in the `netstat -m` output?

Nope, there are no denied mbuf requests after sending icmp echo-request
through virtio-net with all participants' MTU set to 9000:
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
Comment 1 IPTRACE 2017-07-10 22:35:06 UTC
I have the same problem. Cannot use large MTU.

ethertype IPv4 (0x0800), length 4084: truncated-ip - 8 bytes missing! 10.0.0.20 > 10.0.1.15: ICMP echo request, id 60022, seq 22, length 4058
ethertype IPv4 (0x0800), length 4084: truncated-ip - 8 bytes missing! 10.0.0.20 > 10.0.1.15: ICMP echo request, id 60022, seq 23, length 4058
ethertype IPv4 (0x0800), length 4084: 10.0.0.20 > 10.0.1.15: ICMP echo request, id 15479, seq 0, length 4050
ethertype IPv4 (0x0800), length 4084: 10.0.1.15 > 10.0.0.20: ICMP echo reply, id 15479, seq 0, length 4050
Comment 2 Harald Schmalzbauer 2017-07-31 13:58:33 UTC
Just a quick note:
This is not related to r321679 (https://svnweb.freebsd.org/base?view=revision&revision=321679)

From the description I was confident that the problem was in if_vtnet(4) and solved, but the symptoms are still exactly the same after r321679 (tested on 11.1-RELEASE).

Accidentally I first checked with vale(4) instead of if_bridge(4) and saw that the symptom is similar, but with different numbering.
The largest frame possible with vale(4) (and if_vtnet(4) and bhyve(8)) is 2048 bytes, resulting in 2006 bytes ICMP (echo-request) payload.

I'm not sure if the problem is with if_vtnet(4) or bhyve(4).
Unfortunately I don't have the debugging skills to find the code paths myself and not the time learning to do so :-(

Any help highly appreciated.

-harry
Comment 3 Harald Schmalzbauer 2017-07-31 14:34:15 UTC
(In reply to Harald Schmalzbauer from comment #2)

Hmm, reading my own report would have told me that the problem couldn't be in if_vtnet(4), because replacing virtio-net with e1000 at the bhyve(8) part solves the problem...
Just to revise the nonsense-part of my last note.

And to add a note: Using e1000 (instead of virtio-net) doesn't work with vale(4) at all!

-harry
Comment 4 Peter Grehan freebsd_committer 2017-07-31 15:44:11 UTC
(In reply to Harald Schmalzbauer from comment #3)

Yes, it is a bug in bhyve's virtio-net code, where the 'merged rx-buffer' feature isn't implemented to spec. It only uses a single guest buffer, which is usually 2K or 4K. The virtio-net code needs some restructuring to request the virtio common code to look for enough buffers to cover the size of the incoming packet, and to be able to return the length used in each of these back to the common code.

Also as you mentioned, the e1000 emulation doesn't currently work with netmap. There have been patches supplied to fix this - they just need to be tested/integrated.
Comment 5 Harald Schmalzbauer 2017-07-31 16:00:33 UTC
(In reply to Peter Grehan from comment #4)
Peter, thanks a lot for this clarification.

I missed the e1000 diffs. I'm ready to test anything I get to compile :-) Should be no problem for recent netmap diffs, since I'm running netmap from -current on 11.1 (I don't have spare hw for tests with -current unfortunately).

Short off topic request/question:
Since if_vtnet(4) seems to support TSO/GSO, are there plans to provide these for virtio-net? Haven't used vritio-net anywhere else (KVm, XEN, etc.) but I used VMDQ on ESXi (together with vmx3f instead of if_vmx(4)) and the efficiency is really impressive. Wish we could get at least a little closer :-)

-harry
Comment 6 Arjan van der Velde 2017-09-11 02:38:45 UTC
Hi! We are running into this issue in our environment too. We'd like to use jumbo frames w/ NFS in bhyve, using virtio-net. We're on a 10G/40G network and we want to minimize the overhead for networking inside our virtual machines as much as possible.

thanks!

-- Arjan
Comment 7 P Kern 2018-04-02 21:35:57 UTC
Hi. We are encountering the same problem as  Arjan van der Velde.
We also want to pass NFS traffic through our 10G/40G switches.

The difference is our VMs are in VMware ESXi with vmx(8) NICs (ie. VMXNET3).
Packets will not traverse our FreeBSD gateway unless they are under 4084 bytes.
We can ping jumbo packets to either of the vmx(8) NICs, but jumbo packets will
not pass through the gateway.

thanks for any attention/pointers.
P Kern
Comment 8 P Kern 2018-04-03 00:16:40 UTC
(In reply to P Kern from comment #7)
Sigh. I should have rtfm: just noticed vmx(8) does _not_ mention
supporting jumbo frames.  Never mind. P Kern.
Comment 9 Harald Schmalzbauer 2018-04-04 20:17:14 UTC
(In reply to P Kern from comment #8)

*offtopic, vmxnet3 specific only, nothing PR related in this comment*:

It's correct that if_vmx(4) does not mention MTU or "jumbo" frames, but I was quiet sure it _does_ support 9k frames – just verified positive (stable/11 on ESXi6.5)!

if_vmx(4) has been improved over the time, but it still lacks ALTQ support.
And vmx3f(4) is still a bit more efficient.
Else, if_vmx(4) is featurewise on par with vmx3f(4).

Unfortunately vmx3f(4) isn't supported by VMware any longer.
I made a patch which allows vmx3f(4) to be compiled on FreeBSD-11, and it also seems to be stable _without_ ALTQ.  ALTQ causes panics!!!  Unfortunately my skills/time don't last to fix.
Here's the compile-patch in case somebody wants to take over:
ftp://ftp.omnilan.de/pub/FreeBSD/OmniLAN/vmware-esxi_kernel-modules/10.1.5/FreeBSD-11_vmxnet3-Tools_10.1.5-source.patch

-harry
Comment 10 P Kern 2018-04-06 16:16:23 UTC
(In reply to Harald Schmalzbauer from comment #9)
Thanks for the code!   Yes vmx(8) does support recv/xmit of 9k frames
but in a case where the VM has 2 vmx(8) NICs, the 9k frames do not seem to
be able to transit in one NIC and the out the other.  So 9k frames only seem
to "work" when the traffic terminates at the VM (...?).
Just tested this scenario on the same VM with 2 Intel em(8) NICs: 9k frames
seem to pass through the VM via em0<-->em1 (mtu 9k on both) without trouble.
Under the same setup but with vmx0<-->vmx1, the 9k frames cannot seem to flow
thru: traffic will transit only after MTUs are set to 4096.
I'd love to tweak the vmx3f driver but then the VM could not be used for
anything we put into production (small group here. no other BSD kernel divers).
thx again
Comment 11 Rodney W. Grimes freebsd_committer 2018-04-06 16:33:35 UTC
(In reply to P Kern from comment #10)
I believe that issue of not being able to forward pacekts through a VM using vmx(4) with MTU >4K is that on the receive side the incoming packets are chunked up into n * 4k pages and these do not pass through the forwarding code correctly.

This in effect frag's the jumbo frame as it tries to traverse the router,
and I do not think the code is up to that task, nor is that a desirable
situation.
Comment 12 P Kern 2018-04-06 16:53:07 UTC
(In reply to Rodney W. Grimes from comment #11)
[ doh! vmx(4)/em(4) -- not ..(8)!  sigh, brain rot. ]
yup, I was suspecting vmx(_4_) was doing something like that.
With our limited resources, our options for now are ...
   - live with vmx(4) NICs with 4K MTU
or - switch to em(4) NICs with 9k MTU.
Unless there's some other benefit to using em(4) NICs in
our ESXi VMs, we'll probably stick with using vmx NICs.
Comment 13 Harald Schmalzbauer 2018-04-06 19:56:33 UTC
(In reply to P Kern from comment #12)

In case you end up switching from "vmxnet3"/[vmx(4)|vmx3f(4)] to "e1000"/[em(4)], depending on your workload, you can save lots of overhead if you switch to "e1000e" instead, since it utilizes MSI(-X).
To make use of, you need to set 'hw.pci.honor_msi_blacklist=0' in loader.conf.

And then, there's a negotiaten mismatch between FreeBSD/ESXi (ESX is selecting MSI while FreeBSD MSI-X – as far as I remember).  You can circumvent by simply re-loading the kernel module!  "e1000e"/[if_em(4)] works fine in MSI-X mode.

Since FreeBSD-11, there's also devctl(8), which could take care of the driver re-initialization, but when I wrote my rc(8) script to automatically re-load kernel modules on ESXi guests, it was not available.
Happy to share the rc(8) script on request.

-harry
Comment 14 P Kern 2018-04-06 20:01:56 UTC
(In reply to Harald Schmalzbauer from comment #13)
> Happy to share the rc(8) script on request.
yup, "request" please.  thx.
Comment 15 Harald Schmalzbauer 2018-04-07 10:08:00 UTC
(In reply to P Kern from comment #14)

Sorry for so much nonsense and off-topic comments;  But to correct myself in case anybody else wonders....
You do _not_ need the driver reload hack for the "e1000e" _virtual_ 82574 (Intel Hartwell, if_em(4))!!! [e1000 = 82545, which doesn't support MSI, just to mention]
I just read with one eye and confused passthrough interfaces, which is what I prefere to have on ESXi for my FreeBSD guests (most often with 82574 or 82576).  Only the passthru hardware needs the MSI-X negotiation driver reload workaround.

But you probably need the 'hw.pci.honor_msi_blacklist=0' in loader.conf – don't remember well, so please check yourself if you want to avoid unnecessary config options, even if they don't do any harm.

-harry