Summary: | [patch] MSI/MSI-X interrupts don't work in VMware virtual machines | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | bhavesh | ||||
Component: | kern | Assignee: | freebsd-virtualization (Nobody) <virtualization> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Many People | CC: | arunpereira, chrysalis, dmitry.wagin, heoj, igor, jhb, marius, mp, mqudsi, ncrogers, pi, registrazioni, rgrimes | ||||
Priority: | --- | Keywords: | patch | ||||
Version: | CURRENT | ||||||
Hardware: | amd64 | ||||||
OS: | Any | ||||||
Bug Depends on: | |||||||
Bug Blocks: | 178795 | ||||||
Attachments: |
|
Description
bhavesh
2015-10-19 16:32:06 UTC
Created attachment 162212 [details]
patch to remove VMware MSI/MSI-X quirk
Removing the ESXi MSI-X quirk altogether simply is way too gross as the intention is to provide a setup which at least works out of the box, also on older versions of the the VMM and even if that doesn't yield the optimal configuration. As you apparently work for VMware, could you please get the revision ID in the PCI header of 0x079015ad and 0x07a015ad bridges bumped for the implementations that have MSI-X allocation fixed - if that hasn't already be done, yet -, so we can distinguish the good from the bad ones in the PCI code? My logs indicate that at least the bridges reporting revision 0x1 were affected. (In reply to Marius Strobl from comment #2) I tried to search for the history behind why this quirk was introduced for VMware virtual machines in the first place, as we've never shipped a version of ESXi where the implementation of MSI or MSI-X was broken, like the comment suggests it was in ESXi 5.1 and before. So there have been no bumps to the PCI revision IDs of these devices that can identify when it was broken v/s fixed as from our POV it was never broken. Can you point me to some bug report or mailing list archive or something that details what was supposedly broken w.r.t. "MSI-X allocation on ESXi?" (In reply to bhavesh from comment #3) The problem with broken versions of ESXi is that allocation of MSI-X vectors fails when tried for the first time but succeeds on subsequent attempts. As Jack Vogel from Intel pointed out, this behavior inherently reminds of a design bug in Linux/KVM: https://lists.freebsd.org/pipermail/freebsd-stable/2012-October/070220.html The full thread starts here: https://lists.freebsd.org/pipermail/freebsd-stable/2012-October/070196.html I think at that time igb(4) additionally had an orthogonal and meanwhile fixed bug, causing it to fail to properly fall back to MSI or even INTx, leading to an overall non-working default setup when run under ESXi back then. Apart from that, there was a private and independent report to jhb@freebsd.org and me in June 2013, indicating that using INTx with ESXi 5.1 causes interrupt storms on the FreeBSD side. In addition to the VMware documentation of that time listing only three devices known-working with MSI-X in pass-through mode, this lead to the conclusion that MSIs are the only thing that reliably works with ESXi. Consequently, the current behavior of blacklisting the VMware PCI bridges for MSI-X (but contrary to your claim, not for MSI) was implemented and verified to yield a configuration which works out-of-the-box in the known problematic setups involving ESXi. As for the MSI-X allocation bug, it likely isn't/wasn't present in the bridge code of ESXi itself but rather in its LAPIC emulation. Still, bumping the PCI revision IDs to indicate ESXi versions where that problem has been fixed would be way preferable to having to query the hypervisor version by other means in the machine-independant PCI code of FreeBSD for limiting the blacklist entries for VMware. (In reply to Marius Strobl from comment #4) Thanks for the additional background. I had not found those in my searching. It seems https://lists.freebsd.org/pipermail/freebsd-stable/2012-October/070220.html is being used as the authoritative reasoning behind blacklisting VMware VMs (via the PCI bridge and PCIe root port VID/DID) for lack of MSI-X support. Unfortunately that e-mail/thread doesn't give more details than: --quote-- When my driver would first attempt to get some vectors the Linux code would go look at the vector table, but the way the PCI code in FreeBSD works that table is not set up yet, so Linux would see no legitimate vectors in the table, and decide the guest was ineligible for MSIX :(, but by this time the FreeBSD code actually DID write some vectors into the table, and thus when you load the driver the SECOND time Linux would see the table populated and TADA!! would enable the guest to use MSIX. Now, I went thru a bunch of efforts via our Linux team here to have the KVM code fixed, its design was bogus, and I believe it has been, but it sounds like maybe VMWare has the same broken design?? --end-quote-- Need more details, like why the author of that e-mail believes "its design was bogus" and can he mention the GIT commit or a link to the discussion where it was supposedly "fixed" in KVM? I'm Cc'ing Jack directly. Hopefully the e-mail address is still valid. Thanks. (In reply to bhavesh from comment #5) Is there still a good idea on to keep this blacklisting for ESXi, other than exposing bugs in fall-back code in drivers (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218113)? It's been nearly a year and a half since the last comment? I've removed the blacklisting for 11-RELEASE on my box inside ESXi 6.5 and there doesn't seem to be any previously reported signs of "badness" with LSI SAS emulation and igb(4) passthrough so far. This needs to be regression tested as far back as ESXi 5.0 imho, as that is still a common production environment. It would be nice to have a fix that just did the right thing on any of ESXi 5.0 to 6.5. It would also be nice to have a fix that works on FreeBSD 10, 11 and -current, and even nicer if that fix could back port even further back. (In reply to Rodney W. Grimes from comment #7) ESXi 5.0 and 5.1 had their general support discontinued on 24th August 2016: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2145103 Do you have any empirical evidence to suggest that 5.0 is "is still a common production environment," or is this more of a "Windows XP"-style "let's just use it in a hope it doesn't break" type of thing? Further, as bhavesh@vmware.com pointed out ESXi was never broken, and there is hardly anything but speculative rumours to support the original inclusion (not even a reproducible bug report!) One thing that came to my mind regarding some people apparently not being able to reproduce the original problem is that it may also depend on the virtualization capabilities of the hardware used, e. g. APICv, AVIC, MSI/MSI-X remapping etc., i. e. the bug may be present in the full software emulation of ESXi (or some versions thereof) but hidden when interrupt handling is assisted by hardware. (In reply to Marius Strobl from comment #9) And without any bug reports that lead to the exclusion of MSI-X support under FreeBSD we'll never know. Linux has MSI-X under ESXi, if I recall correctly, so chances are that it's a "FreeBSD problem" not an ESXi one. Hi, This bug has been around quite a while. Is there anything that VMware needs to do to resolve this issue? Thanks, Jin Hmm, I don't recall which issue we ran into with VMWare. We have had some issues with Xen hypervisors that might be related. They had to do with how the hypervisor cached MSI-X table entries and that it didn't notice updates to the table that FreeBSD's kernel performed because it assumed the table was not modified once MSI-X was enabled in the capability register. Commits related to those issues are r302181 + r302635 and r310806. From the thread about igb(4) that was quoted, it sounds like it might be a similar issue where VMWare might be assuming that MSI-X tables are only updated in a particular order. FreeBSD follows a scheme where it marks all the table entries as disabled and then enables MSI-X in the capability register and then writes individual table entries as the driver calls bus_setup_intr(). This means that the table can change while MSI-X is enabled. A table entry might also be written multiple times if a device driver moves an interrupt to a different CPU than the one chosen by the system. In real hardware this works fine as the hardware checks the table to compose the MSI message before sending each message. Hypervisors might choose to cache the table since reading the actual table might be too expensive. However, in that case hypervisors have to trap all writes to the table and update their cached copy for each write. The Xen issues had to do with Xen assuming it could ignore writes made while MSI-X was active since Linux writes the whole table and then enables MSI-X. I would start by seeing if you have a shadow copy of the MSI-X table and if you make assumptions about which writes to that table have to be trapped to update the shadow table. This is a ping, we need to move forward on this issue. (In reply to Rodney W. Grimes from comment #13) Given that (a) there is absolutely no definitive background as to why the hack, pardon me, "quirk" was inserted in in the first place, (b) utter disinterest in getting to the bottom of the hack from the kernel "folk," and (c) existence of viable alternatives, we can safely say that the hack these patches were trying to revert was a tombstone on FreeBSD as a passthrough guest---I, for one, just like in bug #178795, have moved onto Linux when I need passthrough; and I doubt we were the only two. Now, given that I do like FreeBSD, I am quite happy to spin up an ESXi box and try whatever. Incidentally, comment #12 doesn't explain why de-quirking worked fine, or why Linux/Windows guests are unaffected. (In reply to igor from comment #14) Comment #12 explains why another hypervisor had issues with FreeBSD due to an incomplete emulation of MSI-X. However, the questions it raises can only be answered by someone who knows how VMWare's hypervisor implements passthrough and how it has changed over time. For example, on Xen we employ a workaround for the known-broken versions (and Xen has been fixed). It may be that if VMWare had a similar issue it was previously fixed as well which would explain why the quirk may not be needed on newer versions of VMWare, but someone from VMWare would have to answer that question. (In reply to John Baldwin from comment #15) The reply from VMWare is in comment #3---their implementation was never broken. They also asked for specific bug reports to be identified where the buggy behaviour can be observed; yet, none were provided, and that was back in 2015! Given that there doesn't seem to be a problem with MSI-X on Linux guests nor Windows guests, what does the application of Occam's razor yield?.. As for VMWare, how would anyone even begin to investigate an unexpected behaviour when that behaviour is only described as "doesn't work in some non-specific circumstances"? I don't think the commenter in #3 understood what "broken" might mean in the context of MSI or MSI-X, so I attempted to give more detail on how a particular type of breakage might boot Linux ok but not FreeBSD (as was the case for Xen). However, if VMWare 6.0 is known to work we could alter the quirk to instead use the version of VMWare (e.g. via SMBIOS strings or what not) instead of PCI device IDs to know when to apply the quirk. In terms of what VMware would investigate, I gave very specific details (do you implement an MSI-X shadow table, etc.) that someone familiar with the implementation would be able to investigate. This is not "not caring", this is "the best suggestions I can give of what to look for given what I know of how FreeBSD's use of MSI-X differs from Linux assuming I don't get to look at the sources myself". (In reply to John Baldwin from comment #17) In order to investigate anything that sporadically breaks, you need to know how to break it. Like I said in comment #6 when I disabled the hack, nothing was breaking; so why would anyone spend $$$ investigating anything without even a clue as to a path to break it? My interpretation of comment #3 is "Our implementation is in accordance with the spec. Do you have a specific scenario that takes our implementation outside of the spec?" To which nothing was produced. Similarly, as bug #218113 shows MSI-X was also "hacked" for KVM (I don't know if that's still the case), so it would seem that according to FreeBSD neither KVM nor ESXi implement the hardware side of MSI-X properly? C'mon! :-) WMware ESXi 5.5.0, 3248547 + IBM System x3550 M3 2xIntel Xeon X5670 MSI-X not working, with hw.pci.honor_msi_blacklist=0 - rebooting. In the case of VMware 5.5, it's VMware vmx binary that's crashing. It doesn't crash on VMware 6.7. However, please note that VMware 5.5 is unsupported and hence you'd be expected to use 6.7. Add me to the list of using MSIX without issues on ESXi, Also on the list of been annoyed having to search for as to why I had no MSIX in default configuration. The reasoning supplied for keeping it blacklisted is laughable, one wonders if someone "sponsored" the request maybe it would be actioned given changelog's are full of "sponsored" patches now days. To be blunt even if the bug was legit, ESXi 5.x is ancient and obsolete, FreeBSD developers refuse to entertain bugs for unsupported builds of FreeBSD, so why are they entertaining this when it works fine on supported versions of ESXi. I think a patch that only disabled MSI for <= 5 seems reasonable given that no one has reported it breaking on anything newer than that when setting the tunable. Comment 19 suggests 5.5 is still broken, so I think disabling it for all of 5.x by default seems prudent. I think that type of quirk would require removing the existing device-ID-based entry and instead adding a "manual" quirk that checks for VMWare (e.g. VM_GUEST_VMWARE) and then checks the major version. I'm not quite sure how to get the major version of VMWare, perhaps via the hypervisor cpuid leaf or the vmware_hvcall() function. Seems a much more straight forward option would be to require those who are using archaic hypervisors to manually set a kernel variable (like kern.disable_msix_on_legacy_hypervisor=1) via loader.conf and add appropriate note to release/updating documentation? I would be ok with a 5.x only blacklist, the problem is I think there is no way to seemingly identify it as apparently the pci id's are the same. So I think should be two changes. 1 - For "all" hardware, if its on a blacklist, add something to DMESG that tells the end user something like "MSIX disabled due to blacklist, can disable via this bootloader" As I think silently blocking a function is the wrong way to go about things. 2 - The default behaviour should be to suit the current supported builds of vmware so default to no blacklist, but keep an option so those using unsupported versions can still use the OS. But if a way could be found to differentiate between new and old vmware then yeah keep a blacklist for old vmware. Also another point of concern, so we know at this point the kernel developer that replied here has vanished. The vmware developers have written a pvscsi driver for FreeBSD, that driver is suffering the same issues as this bug report. It has sat in review status for nearly a year. Initially the vmware developer who submitted it was available for responses but I expect has now given up so this issue of things been left in limbo status is holding back FreeBSD virtualization capability. This bug report was submitted 4 years ago, there does seem to be some kind of issue with approach to this kind of thing. I just said how to check the version in the previous comment. So, no, it is not true that we can't handle versions because the PCI ID doesn't change. As I said in the previous comment, the fix is to make it not be a PCI ID-based quirk. There are published docs that the source code references (e.g. sys/x86/x86/tsc.c and sys/x86/include/vmware.h) on how the hypercalls and leaf functions work, so the patch is not impossible, but someone will have to do some work to write a patch and test it. (In reply to John Baldwin from comment #25) My suggestion is much simpler: remove support for platform that is deprecated by vendor! If someone (is there anyone?) still is on ESXi 5.x then provide them with a loader switch to disable it with a view of deprecating support for ESXi 5.x branch altogether in next major release---why on earth would you spend time and money supporting anything that has been dropped by vendor a long time ago while paralysing the currently supported platforms? I am ok with both proposals. Certainly I think the current approach is wrong, and should be changed. It is a bit like deciding to block the use of math co processor on all modern cpu's just because the legacy pentium chip had a bug on it. Hopefully this isn't too late but I know there were some issues caused with pass-through PCI-e devices and MSI-X support under ESX 5.5+, but I cannot for the life of me find the issue that I know exists somewhere in bugzilla (either this one or on the FreeNAS one). This is not the issue I was searching for (which is one that I've commented on), but here is a report that was originally filed against a FreeBSD system running under ESX 5.1 (not pre-5.0 as claimed in some of the comments above): https://redmine.ixsystems.com/issues/2056 This must have been before the hack was added to pci.c because the fix back then was to globally set hw.pci.enable_msix=0 (or else individually blacklist the LSI device). I personally ran into interrupt storms with MSI-X enabled under 6.0 and a LSI 9211-8i a few years back, which is how I first came across this back in the day. Googling for some variation of `esxi, freebsd, mps, interrupt storm, lsi, pass-through, vmware` brings up a lot of relevant results, almost all of which are post ESX 5.0 I am reminded heavily of Chesterton's fence here and would recommend proceeding only with extreme caution (and after looking into LSI HBA passthrough configurations and scenarios). I looked into this issue a bit to better understand the lineage. There was a report of issues for FreeBSD in ESXi 5.0 back in late 2011: https://forums.freebsd.org/threads/freebsd-9-rc2-breaks-mpt-driver-when-used-with-vmware-and-pcie-passthrough.27899/ And the blacklist quirk was added here in 2013-03: https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?revision=247632&view=markup I could not find a bug report for the above FreeBSD issue within the VMware bug tracker. But it did show up as a Linux 4.3+ issue in 2016-01 reported against ESXi 6.0. The issue and resolution was documented in this knowledge base article which lists fixes in ESXi 5.5 patch 08 and 6.0 patch 03 which came out in 2016. https://kb.vmware.com/s/article/2150585 I will note that Comment #19 indicates it is broken in ESXi 5.5 build 3248547 which appears to be in the patch 07 branch prior to this fix being available. Now, let’s look at EOL dates: vSphere 5.0 and 5.1 EOL - August 24, 2016 vSphere 5.5 EOL – September 19, 2018 vSphere 6.0 EOL – March 2020 There have been patch releases made available in 2016 to address both 5.5 and 6.0, plus having 5.5 already EOL and 6.0 in a few months. Given that it is my recommendation we 1) remove the blacklist for these devices and 2) not investigate or add version checking (which likely would be error prone and would not take into account patch releases). I just proposed a patch to remove the quirk which I can commit with approvals: https://reviews.freebsd.org/D22819 A commit references this bug: Author: mp Date: Wed Dec 18 23:00:56 UTC 2019 New revision: 355889 URL: https://svnweb.freebsd.org/changeset/base/355889 Log: Remove VMware MSI-X from the PCI blacklist. First reported against ESXi 5.0, PCI passthrough was not working due to MSI-X issues. However, this issue was fixed via patch releases against ESXi 5.5 and 6.0 in 2016. Given ESXi 5.5 and earlier have been EOL, this patch removes the VMware MSI-X blacklist entries in the quirk table. PR: 203874 Reviewed by: imp, jhb MFC after: 1 month Sponsored by: VMware Differential Revision: https://reviews.freebsd.org/D22819 Changes: head/sys/dev/pci/pci.c Thanks for the digging Mark. It is amusing that the description of the bug in VMware matches almost exactly the bug Xen had. I only see it in the master branch on FreeBSD i think that it would be useful to see it also for FreeBSD 12 if possible Thanks, Manuel |