Bug 215737 - [bhyve] utilizing virtio-net truncates jumbo frames at 4084 bytes length
Summary: [bhyve] utilizing virtio-net truncates jumbo frames at 4084 bytes length
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-03 16:19 UTC by Harald Schmalzbauer
Modified: 2020-07-17 16:31 UTC (History)
12 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Harald Schmalzbauer 2017-01-03 16:19:52 UTC
Steps to reproduce:
'ifconfig create vmnet0 mtu 9000'
'ifconfig brigde0 addm vmnet0'

Set guest mtu to 9000 (which vtnet(4) claims to support).
Now we can transceive frames up to 4084 bytes,
which this flow from the guest's vtnet(4) interface demonstrates:
16:54:36.672709 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4
(0x0800), length 4084: 172.21.35.1 > 172.21.35.32: ICMP echo request, id
56840, seq 0, length 4050
16:54:36.672791 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4
(0x0800), length 4084: 172.21.35.32 > 172.21.35.1: ICMP echo reply, id
56840, seq 0, length 4050
On the host this looks similar.

Now with a payload size of 4043 instead of 4042 bytes, the reply never
makes it through virtio-net:
Host flow:
16:57:06.641382 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4
(0x0800), length 4085: 172.21.35.1 > 172.21.35.32: ICMP echo request, id
27401, seq 0, length 4051
16:57:06.641399 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4
(0x0800), length 4085: 172.21.35.32 > 172.21.35.1: ICMP echo reply, id
27401, seq 0, length 4051
Guest flow:
16:57:06.642073 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4
(0x0800), length 4085: 172.21.35.1 > 172.21.35.32: ICMP echo request, id
27401, seq 0, length 4051
16:57:06.642233 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4
(0x0800), length 4084: truncated-ip - 1 bytes missing! 172.21.35.32 >
172.21.35.1: ICMP echo reply, id 27401, seq 0, length 405

When using exactly the same setup, just replacing virtio-net with e1000 ('-s 5,virtio-net,vmnet0' with '-s 5,e1000,vmnet0'), jumbo frames do work as expected.

Andrey V. Elsukov idea:
> This looks like the problem with mbufs bigger than PAGE_SIZE.
> Do you see some denied requests in the `netstat -m` output?

Nope, there are no denied mbuf requests after sending icmp echo-request
through virtio-net with all participants' MTU set to 9000:
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
Comment 1 IPTRACE 2017-07-10 22:35:06 UTC
I have the same problem. Cannot use large MTU.

ethertype IPv4 (0x0800), length 4084: truncated-ip - 8 bytes missing! 10.0.0.20 > 10.0.1.15: ICMP echo request, id 60022, seq 22, length 4058
ethertype IPv4 (0x0800), length 4084: truncated-ip - 8 bytes missing! 10.0.0.20 > 10.0.1.15: ICMP echo request, id 60022, seq 23, length 4058
ethertype IPv4 (0x0800), length 4084: 10.0.0.20 > 10.0.1.15: ICMP echo request, id 15479, seq 0, length 4050
ethertype IPv4 (0x0800), length 4084: 10.0.1.15 > 10.0.0.20: ICMP echo reply, id 15479, seq 0, length 4050
Comment 2 Harald Schmalzbauer 2017-07-31 13:58:33 UTC
Just a quick note:
This is not related to r321679 (https://svnweb.freebsd.org/base?view=revision&revision=321679)

From the description I was confident that the problem was in if_vtnet(4) and solved, but the symptoms are still exactly the same after r321679 (tested on 11.1-RELEASE).

Accidentally I first checked with vale(4) instead of if_bridge(4) and saw that the symptom is similar, but with different numbering.
The largest frame possible with vale(4) (and if_vtnet(4) and bhyve(8)) is 2048 bytes, resulting in 2006 bytes ICMP (echo-request) payload.

I'm not sure if the problem is with if_vtnet(4) or bhyve(4).
Unfortunately I don't have the debugging skills to find the code paths myself and not the time learning to do so :-(

Any help highly appreciated.

-harry
Comment 3 Harald Schmalzbauer 2017-07-31 14:34:15 UTC
(In reply to Harald Schmalzbauer from comment #2)

Hmm, reading my own report would have told me that the problem couldn't be in if_vtnet(4), because replacing virtio-net with e1000 at the bhyve(8) part solves the problem...
Just to revise the nonsense-part of my last note.

And to add a note: Using e1000 (instead of virtio-net) doesn't work with vale(4) at all!

-harry
Comment 4 Peter Grehan freebsd_committer freebsd_triage 2017-07-31 15:44:11 UTC
(In reply to Harald Schmalzbauer from comment #3)

Yes, it is a bug in bhyve's virtio-net code, where the 'merged rx-buffer' feature isn't implemented to spec. It only uses a single guest buffer, which is usually 2K or 4K. The virtio-net code needs some restructuring to request the virtio common code to look for enough buffers to cover the size of the incoming packet, and to be able to return the length used in each of these back to the common code.

Also as you mentioned, the e1000 emulation doesn't currently work with netmap. There have been patches supplied to fix this - they just need to be tested/integrated.
Comment 5 Harald Schmalzbauer 2017-07-31 16:00:33 UTC
(In reply to Peter Grehan from comment #4)
Peter, thanks a lot for this clarification.

I missed the e1000 diffs. I'm ready to test anything I get to compile :-) Should be no problem for recent netmap diffs, since I'm running netmap from -current on 11.1 (I don't have spare hw for tests with -current unfortunately).

Short off topic request/question:
Since if_vtnet(4) seems to support TSO/GSO, are there plans to provide these for virtio-net? Haven't used vritio-net anywhere else (KVm, XEN, etc.) but I used VMDQ on ESXi (together with vmx3f instead of if_vmx(4)) and the efficiency is really impressive. Wish we could get at least a little closer :-)

-harry
Comment 6 Arjan van der Velde 2017-09-11 02:38:45 UTC
Hi! We are running into this issue in our environment too. We'd like to use jumbo frames w/ NFS in bhyve, using virtio-net. We're on a 10G/40G network and we want to minimize the overhead for networking inside our virtual machines as much as possible.

thanks!

-- Arjan
Comment 7 P Kern 2018-04-02 21:35:57 UTC
Hi. We are encountering the same problem as  Arjan van der Velde.
We also want to pass NFS traffic through our 10G/40G switches.

The difference is our VMs are in VMware ESXi with vmx(8) NICs (ie. VMXNET3).
Packets will not traverse our FreeBSD gateway unless they are under 4084 bytes.
We can ping jumbo packets to either of the vmx(8) NICs, but jumbo packets will
not pass through the gateway.

thanks for any attention/pointers.
P Kern
Comment 8 P Kern 2018-04-03 00:16:40 UTC
(In reply to P Kern from comment #7)
Sigh. I should have rtfm: just noticed vmx(8) does _not_ mention
supporting jumbo frames.  Never mind. P Kern.
Comment 9 Harald Schmalzbauer 2018-04-04 20:17:14 UTC
(In reply to P Kern from comment #8)

*offtopic, vmxnet3 specific only, nothing PR related in this comment*:

It's correct that if_vmx(4) does not mention MTU or "jumbo" frames, but I was quiet sure it _does_ support 9k frames – just verified positive (stable/11 on ESXi6.5)!

if_vmx(4) has been improved over the time, but it still lacks ALTQ support.
And vmx3f(4) is still a bit more efficient.
Else, if_vmx(4) is featurewise on par with vmx3f(4).

Unfortunately vmx3f(4) isn't supported by VMware any longer.
I made a patch which allows vmx3f(4) to be compiled on FreeBSD-11, and it also seems to be stable _without_ ALTQ.  ALTQ causes panics!!!  Unfortunately my skills/time don't last to fix.
Here's the compile-patch in case somebody wants to take over:
ftp://ftp.omnilan.de/pub/FreeBSD/OmniLAN/vmware-esxi_kernel-modules/10.1.5/FreeBSD-11_vmxnet3-Tools_10.1.5-source.patch

-harry
Comment 10 P Kern 2018-04-06 16:16:23 UTC
(In reply to Harald Schmalzbauer from comment #9)
Thanks for the code!   Yes vmx(8) does support recv/xmit of 9k frames
but in a case where the VM has 2 vmx(8) NICs, the 9k frames do not seem to
be able to transit in one NIC and the out the other.  So 9k frames only seem
to "work" when the traffic terminates at the VM (...?).
Just tested this scenario on the same VM with 2 Intel em(8) NICs: 9k frames
seem to pass through the VM via em0<-->em1 (mtu 9k on both) without trouble.
Under the same setup but with vmx0<-->vmx1, the 9k frames cannot seem to flow
thru: traffic will transit only after MTUs are set to 4096.
I'd love to tweak the vmx3f driver but then the VM could not be used for
anything we put into production (small group here. no other BSD kernel divers).
thx again
Comment 11 Rodney W. Grimes freebsd_committer freebsd_triage 2018-04-06 16:33:35 UTC
(In reply to P Kern from comment #10)
I believe that issue of not being able to forward pacekts through a VM using vmx(4) with MTU >4K is that on the receive side the incoming packets are chunked up into n * 4k pages and these do not pass through the forwarding code correctly.

This in effect frag's the jumbo frame as it tries to traverse the router,
and I do not think the code is up to that task, nor is that a desirable
situation.
Comment 12 P Kern 2018-04-06 16:53:07 UTC
(In reply to Rodney W. Grimes from comment #11)
[ doh! vmx(4)/em(4) -- not ..(8)!  sigh, brain rot. ]
yup, I was suspecting vmx(_4_) was doing something like that.
With our limited resources, our options for now are ...
   - live with vmx(4) NICs with 4K MTU
or - switch to em(4) NICs with 9k MTU.
Unless there's some other benefit to using em(4) NICs in
our ESXi VMs, we'll probably stick with using vmx NICs.
Comment 13 Harald Schmalzbauer 2018-04-06 19:56:33 UTC
(In reply to P Kern from comment #12)

In case you end up switching from "vmxnet3"/[vmx(4)|vmx3f(4)] to "e1000"/[em(4)], depending on your workload, you can save lots of overhead if you switch to "e1000e" instead, since it utilizes MSI(-X).
To make use of, you need to set 'hw.pci.honor_msi_blacklist=0' in loader.conf.

And then, there's a negotiaten mismatch between FreeBSD/ESXi (ESX is selecting MSI while FreeBSD MSI-X – as far as I remember).  You can circumvent by simply re-loading the kernel module!  "e1000e"/[if_em(4)] works fine in MSI-X mode.

Since FreeBSD-11, there's also devctl(8), which could take care of the driver re-initialization, but when I wrote my rc(8) script to automatically re-load kernel modules on ESXi guests, it was not available.
Happy to share the rc(8) script on request.

-harry
Comment 14 P Kern 2018-04-06 20:01:56 UTC
(In reply to Harald Schmalzbauer from comment #13)
> Happy to share the rc(8) script on request.
yup, "request" please.  thx.
Comment 15 Harald Schmalzbauer 2018-04-07 10:08:00 UTC
(In reply to P Kern from comment #14)

Sorry for so much nonsense and off-topic comments;  But to correct myself in case anybody else wonders....
You do _not_ need the driver reload hack for the "e1000e" _virtual_ 82574 (Intel Hartwell, if_em(4))!!! [e1000 = 82545, which doesn't support MSI, just to mention]
I just read with one eye and confused passthrough interfaces, which is what I prefere to have on ESXi for my FreeBSD guests (most often with 82574 or 82576).  Only the passthru hardware needs the MSI-X negotiation driver reload workaround.

But you probably need the 'hw.pci.honor_msi_blacklist=0' in loader.conf – don't remember well, so please check yourself if you want to avoid unnecessary config options, even if they don't do any harm.

-harry
Comment 16 Arjan van der Velde 2020-01-22 12:51:55 UTC
Hi! I just came across these MFCs r354552, r354864 (bhyve: add support for virtio-net mergeable rx buffers).

-- Arjan
Comment 17 Aleksandr Fedorov freebsd_committer freebsd_triage 2020-01-22 14:27:57 UTC
(In reply to Arjan van der Velde from comment #16)
But, it doesn't work for tap backend. Right?
Comment 18 Arjan van der Velde 2020-01-22 16:14:26 UTC
(In reply to Aleksandr Fedorov from comment #17)

Not sure. I was hoping someone here could shed some light on that.
Comment 19 Vincenzo Maffione freebsd_committer freebsd_triage 2020-01-22 23:07:04 UTC
If I'm not mistaken if_vtnet with jumbo frames should work even if the host does not support rx mergeable buffers... if that doesn't work I would be inclined to think that this particular situation is not supported by the driver (IOW 64KB packets are not handled). Can somebody test it again with the stable/11 code?

And yes, currently rx mergeable buffers are advertised by the host only when the vale net backend is used (for both if_vtnet and e1000).

However, now that I think about that, rx mergeable buffers is not something that actually depends on the net backend, so if we advertise them in any case (e.g., also with the tap backend), things should magically work. I'll test this theory in the next days.
Comment 20 Vincenzo Maffione freebsd_committer freebsd_triage 2020-01-23 22:14:57 UTC
I prepared a patch to enable mergeable rx buffers for virtio-net, even with the tap backend.

https://reviews.freebsd.org/D23342

Anyone willing to test this with jumbo frames?
To test it, append "mrgrxbuf=on" to your virtio-net command-line, e.g.
  -s 2:1,virtio-net,tap1,mrgrxbuf=on
Comment 21 Vincenzo Maffione freebsd_committer freebsd_triage 2020-01-26 21:19:52 UTC
Just to clarify the situation, on current HEAD if_vtnet + bhyve + jumbo-frames works as expected, even withouth mergeable rx buffers support.
I did not test on stable/11, but if it is still true that this combination does not work, then it follows that it must be an issue in the stable/11 if_vtnet driver?
Comment 22 Willem Jan Withagen 2020-01-29 11:46:01 UTC
What did not work for me last time I tried is:

Have mtu 9000 on the physical interface, and add bridge to it.
The bridge will automagically get an MTU of 9000

Then try to add a plain TAP to it, you'll get "invallid argument".
Setting the tap MTU to 9000 fixes that problem.

I know this is not a typical bhyve problem, but will fail when using vmrun.sh

--WjW


root@zfstest:/tmp # ifconfig re0
re0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
        ether 9c:5c:8e:84:d6:21
        inet 192.168.10.78 netmask 0xfffffc00 broadcast 192.168.11.255
        inet 192.168.11.78 netmask 0xffffff00 broadcast 192.168.11.255
        inet6 fe80::9e5c:8eff:fe84:d621%re0 prefixlen 64 scopeid 0x1
        inet6 2001:4cb8:3:1::78 prefixlen 64
        inet6 2001:4cb8:3:1::11:78 prefixlen 64
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
root@zfstest:/tmp # ifconfig re0 mtu 9000
root@zfstest:/tmp # ifconfig bridge0 create
root@zfstest:/tmp # ifconfig bridge0 addm re0 up
root@zfstest:/tmp # ifconfig bridge0
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        ether 02:fe:a0:7f:12:00
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: re0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 1 priority 128 path cost 20000
        groups: bridge
        nd6 options=9<PERFORMNUD,IFDISABLED>
root@zfstest:/tmp # ifconfig tap1213 create
root@zfstest:/tmp # ifconfig bridge0 addm tap1213
ifconfig: BRDGADD tap1213: Invalid argument
root@zfstest:/tmp # ifconfig bridge0 addm t
root@zfstest:/tmp # ifconfig tap1213 mtu 9000
root@zfstest:/tmp # ifconfig bridge0 addm tap1213
root@zfstest:/tmp #
Comment 24 Allan Jude freebsd_committer freebsd_triage 2020-07-16 16:47:23 UTC
Can this be closed now?
Comment 25 IPTRACE 2020-07-16 16:49:27 UTC
(In reply to Allan Jude from comment #24)

Yes.