Any PCI* device/driver that relies on MSI or MSI-X interrupts doesn't work in a VMware (ESXi hypervisor) virtual machine.
This is because of a quirk workaround that exists in the FreeBSD kernel that fails requests for MSI or MSI-X interrupt vectors when it detects that the PCIe Root Port or PCI bridge device is a VMware device (PCI Vendor ID 15adh, Device ID 0790h or 07a0h)
The comment in the code suggests MSI/MSI-X didn't work in older versions of ESXi:
* MSI-X allocation doesn't work properly for devices passed through
* by VMware up to at least ESXi 5.1.
I don't think this is true any more. Verified this by booting with:
Everything seems to be working fine with various devices and drivers (passthrough ixgbe/ixgbevf, vmxnet3, etc.) with this boot workaround in a FreeBSD 10.2 VM on ESXi 6.0.
I'll propose a trivial patch to remove this quirk workaround.
Created attachment 162212 [details]
patch to remove VMware MSI/MSI-X quirk
Removing the ESXi MSI-X quirk altogether simply is way too gross as the intention is to provide a setup which at least works out of the box, also on older versions of the the VMM and even if that doesn't yield the optimal configuration. As you apparently work for VMware, could you please get the revision ID in the PCI header of 0x079015ad and 0x07a015ad bridges bumped for the implementations that have MSI-X allocation fixed - if that hasn't already be done, yet -, so we can distinguish the good from the bad ones in the PCI code? My logs indicate that at least the bridges reporting revision 0x1 were affected.
(In reply to Marius Strobl from comment #2)
I tried to search for the history behind why this quirk was introduced for VMware virtual machines in the first place, as we've never shipped a version of ESXi where the implementation of MSI or MSI-X was broken, like the comment suggests it was in ESXi 5.1 and before.
So there have been no bumps to the PCI revision IDs of these devices that can identify when it was broken v/s fixed as from our POV it was never broken.
Can you point me to some bug report or mailing list archive or something that details what was supposedly broken w.r.t. "MSI-X allocation on ESXi?"
(In reply to bhavesh from comment #3)
The problem with broken versions of ESXi is that allocation of MSI-X vectors fails when tried for the first time but succeeds on subsequent attempts. As Jack Vogel from Intel pointed out, this behavior inherently reminds of a design bug in Linux/KVM:
The full thread starts here:
I think at that time igb(4) additionally had an orthogonal and meanwhile fixed bug, causing it to fail to properly fall back to MSI or even INTx, leading to an overall non-working default setup when run under ESXi back then.
Apart from that, there was a private and independent report to firstname.lastname@example.org and me in June 2013, indicating that using INTx with ESXi 5.1 causes interrupt storms on the FreeBSD side. In addition to the VMware documentation of that time listing only three devices known-working with MSI-X in pass-through mode, this lead to the conclusion that MSIs are the only thing that reliably works with ESXi. Consequently, the current behavior of blacklisting the VMware PCI bridges for MSI-X (but contrary to your claim, not for MSI) was implemented and verified to yield a configuration which works out-of-the-box in the known problematic setups involving ESXi.
As for the MSI-X allocation bug, it likely isn't/wasn't present in the bridge code of ESXi itself but rather in its LAPIC emulation. Still, bumping the PCI revision IDs to indicate ESXi versions where that problem has been fixed would be way preferable to having to query the hypervisor version by other means in the machine-independant PCI code of FreeBSD for limiting the blacklist entries for VMware.
(In reply to Marius Strobl from comment #4)
Thanks for the additional background. I had not found those in my searching.
It seems https://lists.freebsd.org/pipermail/freebsd-stable/2012-October/070220.html is being used as the authoritative reasoning behind blacklisting VMware VMs (via the PCI bridge and PCIe root port VID/DID) for lack of MSI-X support.
Unfortunately that e-mail/thread doesn't give more details than:
When my driver would first attempt to get some vectors the Linux code would go look at the vector table, but the way the PCI code in FreeBSD works that table is not set up yet, so Linux would see no legitimate vectors in the table, and decide the guest was ineligible for MSIX :(, but by this time the FreeBSD code actually DID write some vectors into the table, and thus when you load the driver the SECOND time Linux would see the table populated and TADA!! would enable the guest to use MSIX.
Now, I went thru a bunch of efforts via our Linux team here to have the KVM
code fixed, its design was bogus, and I believe it has been, but it sounds like maybe
VMWare has the same broken design??
Need more details, like why the author of that e-mail believes "its design was bogus" and can he mention the GIT commit or a link to the discussion where it was supposedly "fixed" in KVM?
I'm Cc'ing Jack directly. Hopefully the e-mail address is still valid.
(In reply to bhavesh from comment #5)
Is there still a good idea on to keep this blacklisting for ESXi, other than exposing bugs in fall-back code in drivers (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218113)?
It's been nearly a year and a half since the last comment?
I've removed the blacklisting for 11-RELEASE on my box inside ESXi 6.5 and there doesn't seem to be any previously reported signs of "badness" with LSI SAS emulation and igb(4) passthrough so far.
This needs to be regression tested as far back as ESXi 5.0 imho, as that is still a common production environment.
It would be nice to have a fix that just did the right thing on any of ESXi 5.0 to 6.5.
It would also be nice to have a fix that works on FreeBSD 10, 11 and -current, and even nicer if that fix could back port even further back.
(In reply to Rodney W. Grimes from comment #7)
ESXi 5.0 and 5.1 had their general support discontinued on 24th August 2016: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2145103
Do you have any empirical evidence to suggest that 5.0 is "is still a common production environment," or is this more of a "Windows XP"-style "let's just use it in a hope it doesn't break" type of thing? Further, as email@example.com pointed out ESXi was never broken, and there is hardly anything but speculative rumours to support the original inclusion (not even a reproducible bug report!)
One thing that came to my mind regarding some people apparently not being able to reproduce the original problem is that it may also depend on the virtualization capabilities of the hardware used, e. g. APICv, AVIC, MSI/MSI-X remapping etc., i. e. the bug may be present in the full software emulation of ESXi (or some versions thereof) but hidden when interrupt handling is assisted by hardware.
(In reply to Marius Strobl from comment #9)
And without any bug reports that lead to the exclusion of MSI-X support under FreeBSD we'll never know. Linux has MSI-X under ESXi, if I recall correctly, so chances are that it's a "FreeBSD problem" not an ESXi one.
This bug has been around quite a while. Is there anything that VMware needs to do to resolve this issue?
Hmm, I don't recall which issue we ran into with VMWare. We have had some issues with Xen hypervisors that might be related. They had to do with how the hypervisor cached MSI-X table entries and that it didn't notice updates to the table that FreeBSD's kernel performed because it assumed the table was not modified once MSI-X was enabled in the capability register.
Commits related to those issues are r302181 + r302635 and r310806.
From the thread about igb(4) that was quoted, it sounds like it might be a similar issue where VMWare might be assuming that MSI-X tables are only updated in a particular order. FreeBSD follows a scheme where it marks all the table entries as disabled and then enables MSI-X in the capability register and then writes individual table entries as the driver calls bus_setup_intr(). This means that the table can change while MSI-X is enabled. A table entry might also be written multiple times if a device driver moves an interrupt to a different CPU than the one chosen by the system. In real hardware this works fine as the hardware checks the table to compose the MSI message before sending each message. Hypervisors might choose to cache the table since reading the actual table might be too expensive. However, in that case hypervisors have to trap all writes to the table and update their cached copy for each write. The Xen issues had to do with Xen assuming it could ignore writes made while MSI-X was active since Linux writes the whole table and then enables MSI-X. I would start by seeing if you have a shadow copy of the MSI-X table and if you make assumptions about which writes to that table have to be trapped to update the shadow table.
This is a ping, we need to move forward on this issue.
(In reply to Rodney W. Grimes from comment #13)
Given that (a) there is absolutely no definitive background as to why the hack, pardon me, "quirk" was inserted in in the first place, (b) utter disinterest in getting to the bottom of the hack from the kernel "folk," and (c) existence of viable alternatives, we can safely say that the hack these patches were trying to revert was a tombstone on FreeBSD as a passthrough guest---I, for one, just like in bug #178795, have moved onto Linux when I need passthrough; and I doubt we were the only two.
Now, given that I do like FreeBSD, I am quite happy to spin up an ESXi box and try whatever.
Incidentally, comment #12 doesn't explain why de-quirking worked fine, or why Linux/Windows guests are unaffected.
(In reply to igor from comment #14)
Comment #12 explains why another hypervisor had issues with FreeBSD due to an incomplete emulation of MSI-X. However, the questions it raises can only be answered by someone who knows how VMWare's hypervisor implements passthrough and how it has changed over time. For example, on Xen we employ a workaround for the known-broken versions (and Xen has been fixed). It may be that if VMWare had a similar issue it was previously fixed as well which would explain why the quirk may not be needed on newer versions of VMWare, but someone from VMWare would have to answer that question.
(In reply to John Baldwin from comment #15)
The reply from VMWare is in comment #3---their implementation was never broken. They also asked for specific bug reports to be identified where the buggy behaviour can be observed; yet, none were provided, and that was back in 2015!
Given that there doesn't seem to be a problem with MSI-X on Linux guests nor Windows guests, what does the application of Occam's razor yield?.. As for VMWare, how would anyone even begin to investigate an unexpected behaviour when that behaviour is only described as "doesn't work in some non-specific circumstances"?
I don't think the commenter in #3 understood what "broken" might mean in the context of MSI or MSI-X, so I attempted to give more detail on how a particular type of breakage might boot Linux ok but not FreeBSD (as was the case for Xen). However, if VMWare 6.0 is known to work we could alter the quirk to instead use the version of VMWare (e.g. via SMBIOS strings or what not) instead of PCI device IDs to know when to apply the quirk.
In terms of what VMware would investigate, I gave very specific details (do you implement an MSI-X shadow table, etc.) that someone familiar with the implementation would be able to investigate. This is not "not caring", this is "the best suggestions I can give of what to look for given what I know of how FreeBSD's use of MSI-X differs from Linux assuming I don't get to look at the sources myself".
(In reply to John Baldwin from comment #17)
In order to investigate anything that sporadically breaks, you need to know how to break it. Like I said in comment #6 when I disabled the hack, nothing was breaking; so why would anyone spend $$$ investigating anything without even a clue as to a path to break it? My interpretation of comment #3 is "Our implementation is in accordance with the spec. Do you have a specific scenario that takes our implementation outside of the spec?" To which nothing was produced.
Similarly, as bug #218113 shows MSI-X was also "hacked" for KVM (I don't know if that's still the case), so it would seem that according to FreeBSD neither KVM nor ESXi implement the hardware side of MSI-X properly? C'mon! :-)
WMware ESXi 5.5.0, 3248547 + IBM System x3550 M3 2xIntel Xeon X5670
MSI-X not working, with hw.pci.honor_msi_blacklist=0 - rebooting.
In the case of VMware 5.5, it's VMware vmx binary that's crashing. It doesn't crash on VMware 6.7. However, please note that VMware 5.5 is unsupported and hence you'd be expected to use 6.7.
Add me to the list of using MSIX without issues on ESXi, Also on the list of been annoyed having to search for as to why I had no MSIX in default configuration.
The reasoning supplied for keeping it blacklisted is laughable, one wonders if someone "sponsored" the request maybe it would be actioned given changelog's are full of "sponsored" patches now days.
To be blunt even if the bug was legit, ESXi 5.x is ancient and obsolete, FreeBSD developers refuse to entertain bugs for unsupported builds of FreeBSD, so why are they entertaining this when it works fine on supported versions of ESXi.
I think a patch that only disabled MSI for <= 5 seems reasonable given that no one has reported it breaking on anything newer than that when setting the tunable. Comment 19 suggests 5.5 is still broken, so I think disabling it for all of 5.x by default seems prudent. I think that type of quirk would require removing the existing device-ID-based entry and instead adding a "manual" quirk that checks for VMWare (e.g. VM_GUEST_VMWARE) and then checks the major version. I'm not quite sure how to get the major version of VMWare, perhaps via the hypervisor cpuid leaf or the vmware_hvcall() function.
Seems a much more straight forward option would be to require those who are using archaic hypervisors to manually set a kernel variable (like kern.disable_msix_on_legacy_hypervisor=1) via loader.conf and add appropriate note to release/updating documentation?
I would be ok with a 5.x only blacklist, the problem is I think there is no way to seemingly identify it as apparently the pci id's are the same.
So I think should be two changes.
1 - For "all" hardware, if its on a blacklist, add something to DMESG that tells the end user something like "MSIX disabled due to blacklist, can disable via this bootloader"
As I think silently blocking a function is the wrong way to go about things.
2 - The default behaviour should be to suit the current supported builds of vmware so default to no blacklist, but keep an option so those using unsupported versions can still use the OS. But if a way could be found to differentiate between new and old vmware then yeah keep a blacklist for old vmware.
Also another point of concern, so we know at this point the kernel developer that replied here has vanished.
The vmware developers have written a pvscsi driver for FreeBSD, that driver is suffering the same issues as this bug report.
It has sat in review status for nearly a year. Initially the vmware developer who submitted it was available for responses but I expect has now given up so this issue of things been left in limbo status is holding back FreeBSD virtualization capability.
This bug report was submitted 4 years ago, there does seem to be some kind of issue with approach to this kind of thing.
I just said how to check the version in the previous comment. So, no, it is not true that we can't handle versions because the PCI ID doesn't change. As I said in the previous comment, the fix is to make it not be a PCI ID-based quirk. There are published docs that the source code references (e.g. sys/x86/x86/tsc.c and sys/x86/include/vmware.h) on how the hypercalls and leaf functions work, so the patch is not impossible, but someone will have to do some work to write a patch and test it.
(In reply to John Baldwin from comment #25)
My suggestion is much simpler: remove support for platform that is deprecated by vendor! If someone (is there anyone?) still is on ESXi 5.x then provide them with a loader switch to disable it with a view of deprecating support for ESXi 5.x branch altogether in next major release---why on earth would you spend time and money supporting anything that has been dropped by vendor a long time ago while paralysing the currently supported platforms?
I am ok with both proposals.
Certainly I think the current approach is wrong, and should be changed.
It is a bit like deciding to block the use of math co processor on all modern cpu's just because the legacy pentium chip had a bug on it.