Steps to reproduce: 'ifconfig create vmnet0 mtu 9000' 'ifconfig brigde0 addm vmnet0' Set guest mtu to 9000 (which vtnet(4) claims to support). Now we can transceive frames up to 4084 bytes, which this flow from the guest's vtnet(4) interface demonstrates: 16:54:36.672709 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4 (0x0800), length 4084: 172.21.35.1 > 172.21.35.32: ICMP echo request, id 56840, seq 0, length 4050 16:54:36.672791 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4 (0x0800), length 4084: 172.21.35.32 > 172.21.35.1: ICMP echo reply, id 56840, seq 0, length 4050 On the host this looks similar. Now with a payload size of 4043 instead of 4042 bytes, the reply never makes it through virtio-net: Host flow: 16:57:06.641382 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4 (0x0800), length 4085: 172.21.35.1 > 172.21.35.32: ICMP echo request, id 27401, seq 0, length 4051 16:57:06.641399 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4 (0x0800), length 4085: 172.21.35.32 > 172.21.35.1: ICMP echo reply, id 27401, seq 0, length 4051 Guest flow: 16:57:06.642073 00:a0:98:73:9f:42 > 96:07:e9:78:c6:ac, ethertype IPv4 (0x0800), length 4085: 172.21.35.1 > 172.21.35.32: ICMP echo request, id 27401, seq 0, length 4051 16:57:06.642233 96:07:e9:78:c6:ac > 00:a0:98:73:9f:42, ethertype IPv4 (0x0800), length 4084: truncated-ip - 1 bytes missing! 172.21.35.32 > 172.21.35.1: ICMP echo reply, id 27401, seq 0, length 405 When using exactly the same setup, just replacing virtio-net with e1000 ('-s 5,virtio-net,vmnet0' with '-s 5,e1000,vmnet0'), jumbo frames do work as expected. Andrey V. Elsukov idea: > This looks like the problem with mbufs bigger than PAGE_SIZE. > Do you see some denied requests in the `netstat -m` output? Nope, there are no denied mbuf requests after sending icmp echo-request through virtio-net with all participants' MTU set to 9000: 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied
I have the same problem. Cannot use large MTU. ethertype IPv4 (0x0800), length 4084: truncated-ip - 8 bytes missing! 10.0.0.20 > 10.0.1.15: ICMP echo request, id 60022, seq 22, length 4058 ethertype IPv4 (0x0800), length 4084: truncated-ip - 8 bytes missing! 10.0.0.20 > 10.0.1.15: ICMP echo request, id 60022, seq 23, length 4058 ethertype IPv4 (0x0800), length 4084: 10.0.0.20 > 10.0.1.15: ICMP echo request, id 15479, seq 0, length 4050 ethertype IPv4 (0x0800), length 4084: 10.0.1.15 > 10.0.0.20: ICMP echo reply, id 15479, seq 0, length 4050
Just a quick note: This is not related to r321679 (https://svnweb.freebsd.org/base?view=revision&revision=321679) From the description I was confident that the problem was in if_vtnet(4) and solved, but the symptoms are still exactly the same after r321679 (tested on 11.1-RELEASE). Accidentally I first checked with vale(4) instead of if_bridge(4) and saw that the symptom is similar, but with different numbering. The largest frame possible with vale(4) (and if_vtnet(4) and bhyve(8)) is 2048 bytes, resulting in 2006 bytes ICMP (echo-request) payload. I'm not sure if the problem is with if_vtnet(4) or bhyve(4). Unfortunately I don't have the debugging skills to find the code paths myself and not the time learning to do so :-( Any help highly appreciated. -harry
(In reply to Harald Schmalzbauer from comment #2) Hmm, reading my own report would have told me that the problem couldn't be in if_vtnet(4), because replacing virtio-net with e1000 at the bhyve(8) part solves the problem... Just to revise the nonsense-part of my last note. And to add a note: Using e1000 (instead of virtio-net) doesn't work with vale(4) at all! -harry
(In reply to Harald Schmalzbauer from comment #3) Yes, it is a bug in bhyve's virtio-net code, where the 'merged rx-buffer' feature isn't implemented to spec. It only uses a single guest buffer, which is usually 2K or 4K. The virtio-net code needs some restructuring to request the virtio common code to look for enough buffers to cover the size of the incoming packet, and to be able to return the length used in each of these back to the common code. Also as you mentioned, the e1000 emulation doesn't currently work with netmap. There have been patches supplied to fix this - they just need to be tested/integrated.
(In reply to Peter Grehan from comment #4) Peter, thanks a lot for this clarification. I missed the e1000 diffs. I'm ready to test anything I get to compile :-) Should be no problem for recent netmap diffs, since I'm running netmap from -current on 11.1 (I don't have spare hw for tests with -current unfortunately). Short off topic request/question: Since if_vtnet(4) seems to support TSO/GSO, are there plans to provide these for virtio-net? Haven't used vritio-net anywhere else (KVm, XEN, etc.) but I used VMDQ on ESXi (together with vmx3f instead of if_vmx(4)) and the efficiency is really impressive. Wish we could get at least a little closer :-) -harry
Hi! We are running into this issue in our environment too. We'd like to use jumbo frames w/ NFS in bhyve, using virtio-net. We're on a 10G/40G network and we want to minimize the overhead for networking inside our virtual machines as much as possible. thanks! -- Arjan
Hi. We are encountering the same problem as Arjan van der Velde. We also want to pass NFS traffic through our 10G/40G switches. The difference is our VMs are in VMware ESXi with vmx(8) NICs (ie. VMXNET3). Packets will not traverse our FreeBSD gateway unless they are under 4084 bytes. We can ping jumbo packets to either of the vmx(8) NICs, but jumbo packets will not pass through the gateway. thanks for any attention/pointers. P Kern
(In reply to P Kern from comment #7) Sigh. I should have rtfm: just noticed vmx(8) does _not_ mention supporting jumbo frames. Never mind. P Kern.
(In reply to P Kern from comment #8) *offtopic, vmxnet3 specific only, nothing PR related in this comment*: It's correct that if_vmx(4) does not mention MTU or "jumbo" frames, but I was quiet sure it _does_ support 9k frames – just verified positive (stable/11 on ESXi6.5)! if_vmx(4) has been improved over the time, but it still lacks ALTQ support. And vmx3f(4) is still a bit more efficient. Else, if_vmx(4) is featurewise on par with vmx3f(4). Unfortunately vmx3f(4) isn't supported by VMware any longer. I made a patch which allows vmx3f(4) to be compiled on FreeBSD-11, and it also seems to be stable _without_ ALTQ. ALTQ causes panics!!! Unfortunately my skills/time don't last to fix. Here's the compile-patch in case somebody wants to take over: ftp://ftp.omnilan.de/pub/FreeBSD/OmniLAN/vmware-esxi_kernel-modules/10.1.5/FreeBSD-11_vmxnet3-Tools_10.1.5-source.patch -harry
(In reply to Harald Schmalzbauer from comment #9) Thanks for the code! Yes vmx(8) does support recv/xmit of 9k frames but in a case where the VM has 2 vmx(8) NICs, the 9k frames do not seem to be able to transit in one NIC and the out the other. So 9k frames only seem to "work" when the traffic terminates at the VM (...?). Just tested this scenario on the same VM with 2 Intel em(8) NICs: 9k frames seem to pass through the VM via em0<-->em1 (mtu 9k on both) without trouble. Under the same setup but with vmx0<-->vmx1, the 9k frames cannot seem to flow thru: traffic will transit only after MTUs are set to 4096. I'd love to tweak the vmx3f driver but then the VM could not be used for anything we put into production (small group here. no other BSD kernel divers). thx again
(In reply to P Kern from comment #10) I believe that issue of not being able to forward pacekts through a VM using vmx(4) with MTU >4K is that on the receive side the incoming packets are chunked up into n * 4k pages and these do not pass through the forwarding code correctly. This in effect frag's the jumbo frame as it tries to traverse the router, and I do not think the code is up to that task, nor is that a desirable situation.
(In reply to Rodney W. Grimes from comment #11) [ doh! vmx(4)/em(4) -- not ..(8)! sigh, brain rot. ] yup, I was suspecting vmx(_4_) was doing something like that. With our limited resources, our options for now are ... - live with vmx(4) NICs with 4K MTU or - switch to em(4) NICs with 9k MTU. Unless there's some other benefit to using em(4) NICs in our ESXi VMs, we'll probably stick with using vmx NICs.
(In reply to P Kern from comment #12) In case you end up switching from "vmxnet3"/[vmx(4)|vmx3f(4)] to "e1000"/[em(4)], depending on your workload, you can save lots of overhead if you switch to "e1000e" instead, since it utilizes MSI(-X). To make use of, you need to set 'hw.pci.honor_msi_blacklist=0' in loader.conf. And then, there's a negotiaten mismatch between FreeBSD/ESXi (ESX is selecting MSI while FreeBSD MSI-X – as far as I remember). You can circumvent by simply re-loading the kernel module! "e1000e"/[if_em(4)] works fine in MSI-X mode. Since FreeBSD-11, there's also devctl(8), which could take care of the driver re-initialization, but when I wrote my rc(8) script to automatically re-load kernel modules on ESXi guests, it was not available. Happy to share the rc(8) script on request. -harry
(In reply to Harald Schmalzbauer from comment #13) > Happy to share the rc(8) script on request. yup, "request" please. thx.
(In reply to P Kern from comment #14) Sorry for so much nonsense and off-topic comments; But to correct myself in case anybody else wonders.... You do _not_ need the driver reload hack for the "e1000e" _virtual_ 82574 (Intel Hartwell, if_em(4))!!! [e1000 = 82545, which doesn't support MSI, just to mention] I just read with one eye and confused passthrough interfaces, which is what I prefere to have on ESXi for my FreeBSD guests (most often with 82574 or 82576). Only the passthru hardware needs the MSI-X negotiation driver reload workaround. But you probably need the 'hw.pci.honor_msi_blacklist=0' in loader.conf – don't remember well, so please check yourself if you want to avoid unnecessary config options, even if they don't do any harm. -harry
Hi! I just came across these MFCs r354552, r354864 (bhyve: add support for virtio-net mergeable rx buffers). -- Arjan
(In reply to Arjan van der Velde from comment #16) But, it doesn't work for tap backend. Right?
(In reply to Aleksandr Fedorov from comment #17) Not sure. I was hoping someone here could shed some light on that.
If I'm not mistaken if_vtnet with jumbo frames should work even if the host does not support rx mergeable buffers... if that doesn't work I would be inclined to think that this particular situation is not supported by the driver (IOW 64KB packets are not handled). Can somebody test it again with the stable/11 code? And yes, currently rx mergeable buffers are advertised by the host only when the vale net backend is used (for both if_vtnet and e1000). However, now that I think about that, rx mergeable buffers is not something that actually depends on the net backend, so if we advertise them in any case (e.g., also with the tap backend), things should magically work. I'll test this theory in the next days.
I prepared a patch to enable mergeable rx buffers for virtio-net, even with the tap backend. https://reviews.freebsd.org/D23342 Anyone willing to test this with jumbo frames? To test it, append "mrgrxbuf=on" to your virtio-net command-line, e.g. -s 2:1,virtio-net,tap1,mrgrxbuf=on
Just to clarify the situation, on current HEAD if_vtnet + bhyve + jumbo-frames works as expected, even withouth mergeable rx buffers support. I did not test on stable/11, but if it is still true that this combination does not work, then it follows that it must be an issue in the stable/11 if_vtnet driver?
What did not work for me last time I tried is: Have mtu 9000 on the physical interface, and add bridge to it. The bridge will automagically get an MTU of 9000 Then try to add a plain TAP to it, you'll get "invallid argument". Setting the tap MTU to 9000 fixes that problem. I know this is not a typical bhyve problem, but will fail when using vmrun.sh --WjW root@zfstest:/tmp # ifconfig re0 re0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE> ether 9c:5c:8e:84:d6:21 inet 192.168.10.78 netmask 0xfffffc00 broadcast 192.168.11.255 inet 192.168.11.78 netmask 0xffffff00 broadcast 192.168.11.255 inet6 fe80::9e5c:8eff:fe84:d621%re0 prefixlen 64 scopeid 0x1 inet6 2001:4cb8:3:1::78 prefixlen 64 inet6 2001:4cb8:3:1::11:78 prefixlen 64 media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> root@zfstest:/tmp # ifconfig re0 mtu 9000 root@zfstest:/tmp # ifconfig bridge0 create root@zfstest:/tmp # ifconfig bridge0 addm re0 up root@zfstest:/tmp # ifconfig bridge0 bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 ether 02:fe:a0:7f:12:00 id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 member: re0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 1 priority 128 path cost 20000 groups: bridge nd6 options=9<PERFORMNUD,IFDISABLED> root@zfstest:/tmp # ifconfig tap1213 create root@zfstest:/tmp # ifconfig bridge0 addm tap1213 ifconfig: BRDGADD tap1213: Invalid argument root@zfstest:/tmp # ifconfig bridge0 addm t root@zfstest:/tmp # ifconfig tap1213 mtu 9000 root@zfstest:/tmp # ifconfig bridge0 addm tap1213 root@zfstest:/tmp #
https://svnweb.freebsd.org/base?view=revision&revision=358180
Can this be closed now?
(In reply to Allan Jude from comment #24) Yes.