Created attachment 178755 [details] Screenshot Terminal SSH Hi problem with bhyve when using more then 1 cpu bhyve crashes. Trying windows 10 x64 on an AMD A8 7600, ASRock FM2A88X, 32Gb DDR3-1600. 500Gb FreebSD 11 updated today before try, also bhyve-firmware updated to latest version. It works only with 1 cpu.... Using to boot: sudo bhyve -c 2 -m 4G -H -w \ -s 0,hostbridge \ -s 3,ahci-cd,virtio-win-0.1.126.iso \ -s 4,ahci-hd,win10.img \ -s 5,virtio-net,tap10 \ -s 29,fbuf,tcp=0.0.0.0:5900,wait \ -s 30,xhci,tablet \ -s 31,lpc \ -l com1,stdio \ -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \ win10 Only works when changing '-c 2' into '-c 1' I tried on an intel i5 and it worked with 2 cpu's so it looks like this is an AMD related problem.
Also added following lines to /boot/loader.conf hw.vmm.topology.cores_per_package=4 hw.vmm.topology.threads_per_core=4 but still only can install/boot windows bhyve with 1 cpu more then 1 will freeze the vm
The workaround is to install with 1, and then increase that post-install. I can reproduce this. Looks like it needs some quality time in the Windows debugger to see where the CPUs start to spin.
(In reply to Peter Grehan from comment #2) I tried after install to start with 2 cpu's but it hangs in start screen of windows 10 x64
You have to wait until the install is complete (i.e. the 3rd reboot, where you enter username etc). At that point, you should be able to power off and then restart with > 1 vCPU.
(In reply to Peter Grehan from comment #4) I did that. 1 cpu: - Install Reboot - Setup Windows Reboot -Install Virtio driver Reboot When everything was setup booted with 2 vcpu's But then the thing freezes.... with 1 cpu it does boot. I can try again but i tried with FreeBSD 12 and 11 and the same thing happens.
I didn't add the virtio driver - maybe that was what did it. Also, I'm installing on an Opteron 6320. The process was, with 1 vCPU install - reboot 2nd phase - reboot final phase (set up account, etc. Goes to desktop) - reboot. Now restart with multiple vCPUs. Tried 2, and also 6 after setting hw.vmm.topology.cores_per_package.
(In reply to Peter Grehan from comment #6) I did it almost the same way only Virtio driver and tried 2 and 4 cpu's Also added hw.vmm.topology.cores_per_package="4" to /boot/loader.conf I have really now idea maybe something that bhyve does not like about the A8?
Same behaviour here on a Ryzen 1700 and "FreeBSD 12.0-CURRENT #0 334829e6c(drm-next)-dirty". Setting vCPU count greater than 1 leads to random lock-ups of the Windows 10 VM. Two, sometimes three of the vCPUs are creating 100% load on the host system. Keyboard input via VNC doesn't work at all. "bhyve" itself writes: ------------------------------------------------------------------------------- fbuf frame buffer base: 0xa43200000 [sz 16777216] rdmsr to register 0xc0010114 on vcpu 0 rdmsr to register 0xc0010114 on vcpu 1 wrmsr to register 0x10(0) on vcpu 1 rdmsr to register 0xc0010114 on vcpu 2 wrmsr to register 0x10(0) on vcpu 2 rdmsr to register 0xc0010114 on vcpu 3 wrmsr to register 0x10(0) on vcpu 3 wrmsr to register 0x10(0xcc75fcd2078) on vcpu 3 wrmsr to register 0x10(0xcc75fcd2078) on vcpu 0 wrmsr to register 0x10(0xcc75fcd2078) on vcpu 1 wrmsr to register 0x10(0xcc75fcd2078) on vcpu 2 atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full atkbd data buffer full ------------------------------------------------------------------------------- sysctls: ------------------------------------------------------------------------------- #sysctl hw.vmm hw.vmm.npt.pmap_flags: 507 hw.vmm.svm.num_asids: 32768 hw.vmm.svm.disable_npf_assist: 0 hw.vmm.svm.features: 113919 hw.vmm.svm.vmcb_clean: 959 hw.vmm.vmx.vpid_alloc_failed: 0 hw.vmm.vmx.posted_interrupt_vector: -1 hw.vmm.vmx.cap.posted_interrupts: 0 hw.vmm.vmx.cap.virtual_interrupt_delivery: 0 hw.vmm.vmx.cap.invpcid: 0 hw.vmm.vmx.cap.monitor_trap: 0 hw.vmm.vmx.cap.unrestricted_guest: 0 hw.vmm.vmx.cap.pause_exit: 0 hw.vmm.vmx.cap.halt_exit: 0 hw.vmm.vmx.initialized: 0 hw.vmm.vmx.cr4_zeros_mask: 0 hw.vmm.vmx.cr4_ones_mask: 0 hw.vmm.vmx.cr0_zeros_mask: 0 hw.vmm.vmx.cr0_ones_mask: 0 hw.vmm.ept.pmap_flags: 0 hw.vmm.vrtc.flag_broken_time: 1 hw.vmm.ppt.devices: 0 hw.vmm.iommu.enable: 1 hw.vmm.iommu.initialized: 0 hw.vmm.bhyve_xcpuids: 136 hw.vmm.topology.cpuid_leaf_b: 1 hw.vmm.topology.cores_per_package: 4 hw.vmm.topology.threads_per_core: 1 hw.vmm.create: beavis hw.vmm.destroy: beavis hw.vmm.trace_guest_exceptions: 0 hw.vmm.ipinum: 251 hw.vmm.halt_detection: 1 ------------------------------------------------------------------------------- started "bhyve" with: ------------------------------------------------------------------------------- bhyve -c 4 -m 8G \ -w -H -A -P \ -s 0,amd_hostbridge \ -s 1,lpc \ -s 2,ahci-cd,/mnt/ryzen/iso/Windows10-PRO.de.iso \ -s 3,ahci-hd,/mnt/ryzen/vms/${NAME}/lun0.img \ -s 9,e1000,tap${ID} \ -s 29,fbuf,tcp=0.0.0.0:5901,w=1024,h=768,wait \ -s 30,xhci,tablet \ -l com1,/dev/nmdm0A \ -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \ ${NAME} ------------------------------------------------------------------------------- looking at the "ktrace -p" file, I see lots of: ------------------------------------------------------------------------------- [...] 3826 vcpu 0 CALL ioctl(0x3,0xc0907601,0x7fffddbebe30) 3826 vcpu 3 RET ioctl 0 3826 vcpu 0 RET ioctl 0 3826 vcpu 0 CALL ioctl(0x3,0xc0907601,0x7fffddbebe30) 3826 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd3e7e30) 3826 vcpu 2 CALL ioctl(0x3,0xc0907601,0x7fffdd5e8e30) 3826 vcpu 3 RET ioctl 0 3826 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd3e7e30) 3826 vcpu 2 RET ioctl 0 3826 vcpu 3 RET ioctl 0 3826 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd3e7e30) 3826 vcpu 0 RET ioctl 0 3826 vcpu 3 RET ioctl 0 3826 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd3e7e30) 3826 vcpu 0 CALL ioctl(0x3,0xc0907601,0x7fffddbebe30) 3826 vcpu 3 RET ioctl 0 3826 vcpu 0 RET ioctl 0 3826 vcpu 2 CALL ioctl(0x3,0xc0907601,0x7fffdd5e8e30) 3826 vcpu 2 RET ioctl 0 3826 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd3e7e30) 3826 vcpu 3 RET ioctl 0 3826 vcpu 0 CALL ioctl(0x3,0xc0907601,0x7fffddbebe30) 3826 vcpu 2 CALL ioctl(0x3,0xc0907601,0x7fffdd5e8e30) 3826 vcpu 2 RET ioctl 0 [...] ------------------------------------------------------------------------------- Anything I can do to help debugging?
Insta-repro for me on a Ryzen 1700. Happens almost immediately on install with >= 2 vCPUs, and the more configured, the faster the freeze. Single vCPU install is reliable, and I've been able to get occasional long uptimes with server sku's and 2 vCPUs. I also see cases where it's only some vCPUs that are stuck at 100% - sometimes 2, with the remainder idle. The RIPs of the spinning vCPUs are generally constant, indicating a lock-spin or similar. To debug further with Windows, it probably needs the Windows kernel debugger to be hooked up, and then trapped into once the spin is seen. However, I can repro this doing a FreeBSD buildworld with >= 12 vCPUs. It takes a lot longer (~20 mins) but seems to be reliable. Backtraces in ddb seem to show a missed IPI while holding a spinlock, which eventually blocks the entire system.
Peter Grehan wrote in comment #9: > However, I can repro this doing a FreeBSD buildworld with >= 12 vCPUs. It takes > a lot longer (~20 mins) but seems to be reliable. Backtraces in ddb seem to > show a missed IPI while holding a spinlock, which eventually blocks the entire > system. is that a DDB from within the guest VM or the host?
It's ddb from within the guest. The signature is: 1 vCPU will panic with a lock-spin timeout: CPU 11, panic spin lock 0xffffffff81ea0480 (smp rendezvous) held by 0xfffff800079da000 (tid 100093) too long vpanic() at vpanic+0x1b9/frame 0xfffffe02ba76f6f0 panic() at panic+0x43/frame 0xfffffe02ba76f750 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x328/frame 0xfffffe02ba76f7d0 __mtx_lock_spin_flags() at __mtx_lock_spin_flags+0xe0/frame 0xfffffe02ba76f810 smp_rendezvous_cpus() at smp_rendezvous_cpus+0xab/frame 0xfffffe02ba76f880 dtrace_sync() at dtrace_sync+0x77/frame 0xfffffe02ba76f8d0 dtrace_state_deadman() at dtrace_state_deadman+0x13/frame 0xfffffe02ba76f900 That spinlock is held by another vCPU that is waiting for an ack to it's IPI CPU 5 --- trap 0x13, rip = 0xffffffff81033ac2, rsp = 0xfffffe02c8009860, rbp = 0xfffffe02c80098d0 --- smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x352/frame 0xfffffe02c80098d0 smp_masked_invlpg() at smp_masked_invlpg+0x4c/frame 0xfffffe02c8009900 pmap_invalidate_page() at pmap_invalidate_page+0x191/frame 0xfffffe02c8009950 pmap_ts_referenced() at pmap_ts_referenced+0x7b3/frame 0xfffffe02c8009a00 vm_pageout() at vm_pageout+0xe04/frame 0xfffffe02c8009a70 ... and all other vCPUs are waiting on the lock held by the vCPU awaiting the ack. --- trap 0x13, rip = 0xffffffff80a8d222, rsp = 0xfffffe02c8349600, rbp = 0xfffffe02c8349610 --- lock_delay() at lock_delay+0x42/frame 0xfffffe02c8349610 __mtx_lock_sleep() at __mtx_lock_sleep+0x228/frame 0xfffffe02c83496a0 __mtx_lock_flags() at __mtx_lock_flags+0xe8/frame 0xfffffe02c83496f0 vm_page_enqueue() at vm_page_enqueue+0x6b/frame 0xfffffe02c8349720 vm_fault_hold() at vm_fault_hold+0x1ab9/frame 0xfffffe02c8349850 vm_fault() at vm_fault+0x75/frame 0xfffffe02c8349890
I had this vCPU lock-up behaviour on a "Phenom II X6 1055T", too. So it seems that the desktop lines of AMD CPUs are generally unsupported in bhyve's SVM implementation. Ok, while studying https://en.wikipedia.org/wiki/Inter-processor_interrupt and applying to http://mitadmissions.org/apply is there anything I can check/debug here on my system? I have no idea how to remotely kernel-debug Windows...
I appear to be running into the same problem with different circumstances. I am running a Windows 2012R2 VM with a little help from chyves. It works perfectly well for a 3-4 days, idling at about 1% CPU on my Xeon E5-2630v3. Then, the VM goes unresponsive and bhyve starts consuming ~100% of a core, regardless of the number of vCPUs assigned to the VM. I've tested this with both one and four cores assigned to the VM. One crash filled the screen with "atkbd data buffer full", but most don't. The VNC console is blank and unresponsive. ================================================== Platform: ================================================== White box server Motherboard: Asrock X99/Extreme4 CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (2399.35-MHz K8-class CPU) RAM: 8x 16 GB ECC (128GB) Intel NIC, IBM/LSI HBA, a few other odds and ends i doubt would make much difference =================================================== top reports one of 16 (8, hyperthreaded) cores in use: =================================================== last pid: 41496; load averages: 1.12, 1.13, 1.09 up 44+19:40:00 17:55:06 59 processes: 1 running, 58 sleeping CPU: 0.0% user, 0.0% nice, 6.2% system, 0.0% interrupt, 93.8% idle Mem: 12M Active, 1281M Inact, 121G Wired, 2684M Free ARC: 92G Total, 39G MFU, 50G MRU, 300K Anon, 787M Header, 2850M Other Swap: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 29470 root 23 20 0 17482M 6750M kqread 3 21.9H 101.21% bhyve ================================================== root@chef:~ # uname -a FreeBSD chef.bofh 11.0-STABLE FreeBSD 11.0-STABLE #0: Fri Mar 3 04:28:46 CET 2017 root@chef.bofh:/usr/obj/usr/src/sys/CHEF amd64 root@chef:~ # chyves ike get all Getting all ike's properties... bargs -A -H -P -S bhyve_disk_type ahci-hd bhyve_net_type e1000 bhyveload_flags chyves_guest_version 0300 cpu 4 creation Created on Fri Mar 24 20:26:53 CET 2017 by chyves v0.2.0 2016/09/11 using __create() description - eject_iso_on_n_reboot 3 loader uefi net_ifaces tap51 notes - os windows ram 16G rcboot 0 revert_to_snapshot revert_to_snapshot_method off serial nmdm51 template no uefi_console_output vnc uefi_firmware BHYVE_UEFI.fd uefi_vnc_client print uefi_vnc_client_custom_cmd uefi_vnc_ip 0.0.0.0 uefi_vnc_mouse_type usb3 uefi_vnc_pause_until_client_connect no uefi_vnc_port 5901 uefi_vnc_res 800x600 uuid d5302114-10c7-11e7-91c6-d05099803cdc ================================================== I get the same kdump output as Nils Beyer: ================================================== 29470 vcpu 1 CALL ioctl(0x3,0xc0907601,0x7fffdd9eae30) 29470 vcpu 2 CALL ioctl(0x3,0xc0907601,0x7fffdd7e9e30) 29470 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd5e8e30) 29470 vcpu 1 RET ioctl 0 29470 vcpu 3 RET ioctl 0 29470 vcpu 1 CALL ioctl(0x3,0xc0907601,0x7fffdd9eae30) 29470 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd5e8e30) 29470 vcpu 2 RET ioctl 0 29470 vcpu 2 CALL ioctl(0x3,0xc0907601,0x7fffdd7e9e30) 29470 vcpu 1 RET ioctl 0 29470 vcpu 1 CALL ioctl(0x3,0xc0907601,0x7fffdd9eae30) 29470 vcpu 3 RET ioctl 0 29470 vcpu 3 CALL ioctl(0x3,0xc0907601,0x7fffdd5e8e30) 29470 vcpu 2 RET ioctl 0 29470 vcpu 2 CALL ioctl(0x3,0xc0907601,0x7fffdd7e9e30) 29470 vcpu 0 RET ioctl 0 29470 vcpu 0 CALL ioctl(0x3,0xc0907601,0x7fffddbebe30)
>bhyve_net_type e1000 The lockup you are seeing is unrelated to the AMD one, and is a known one with the e1000 under Windows. I've created 218715 to track the e1000 issue.
(In reply to Peter Grehan from comment #11) Peter, do you have any news regarding that issue? The guest freezes still happen on 11.1-RELEASE. Sometimes the Windows 10 guest boots, I can login, but then it freezes after some time (all vcores 100% loaded). Sometimes it even freezes before the Windows login screen. ------------------------------------------------------------------------------- hw.vmm.topology.cores_per_package: 16 hw.vmm.topology.threads_per_core: 1 ------------------------------------------------------------------------------- AMD SVM is not production-ready, yet, is it?
I've been working with Anish to narrow down the problem seen on the Ryzen with a FreeBSD guest. We are making (slow) progress on this. >AMD SVM is not production-ready, yet, is it? It depends on the guest. I've not seen any issues with Linux guests for example.
(In reply to Peter Grehan from comment #16) cool, thanks...
I have been able to reproduce something like this: FreeBSD-11.1-RC3 host, FreeBSD 11.1-RC3 guest. Host: AMD 9590 (8 core), 32G RAM. Guest: 4 cores, 4G RAM. make -j4 buildworld on the guest.
(In reply to dgilbert from comment #18) Would you be able to try your same test, but with the guest vCPUs pinned ? e.g. add the following bhyve parameters -p 0:1 -p 1:2 -p 2:3 -p 3:4
(In reply to Peter Grehan from comment #19) You asked me this in email on the list. I replied that this didn't seem to have any effect... Ie: it still hung.
(In reply to dgilbert from comment #20) Sorry, didn't know that was you. There are 2 other things to try here: - when the guest is hung, on the host issue bhyvectl --get-rip --cpu=0 --vm=<your vm name> bhyvectl --get-rip --cpu=1 --vm=<your vm name> bhyvectl --get-rip --cpu=2 --vm=<your vm name> bhyvectl --get-rip --cpu=3 --vm=<your vm name> You can look at what the resulting RIP values correspond to by restarting the guest, and within the guest, kgdb /boot/kernel/kernel x/i <rip value> - Run the same test with a 12-current guest. With luck, it will panic and drop into ddb. If it hangs but doesn't panic, for the guest to drop into ddb from the host by issuing bhyvectl --inject-nmi --vm=<your vm name> From within ddb you can issue a backtrace.
Hey, Is there a solution or has there been any progress on debugging this? I'm getting the same issue with a 1100T on Win10 Pro / Win10 Education / Windows Server 2016 Datacenter. I've been trying to setup surveillance software which unfortunately needs more than one core. Many Thanks, Dom
This exact problem also happens under bhyve in FreeNAS 11.1 when installing pfSense or OPNsense, so this is not limited to Windows guests - perhaps it is easier to debug with FreeBSD based guests?
Yes, much easier with a FreeBSD (-based) guest. Some config questions - what version of pfsense/Opnsense, how many guest vCPUs, and what's the AMD h/w setup ?
(In reply to Peter Grehan from comment #24) OPNsense-17.7.5-OpenSSL-dvd-amd64.iso pfSense-CE-2.4.2-RELEASE-amd64.iso Latest FreeNAS 11.1 2,4,8 vCPU 4,8 vRAM Threadripper 1950X MSI X399 GAMING PRO CARBON AC (latest BIOS) 8x16GB 3200Mhz DDR4 3x 512GB NVMe in RAIDz1 - 40GB ZVOL per guest
Created attachment 189559 [details] ktr capture of the problem I am able to reproduce the problem with a FreeBSD guest on Phenom II X6 1090T. The problem seems to be a guest IPI lost by vmm/svm. The attached ktr demonstrates that.
Please see https://reviews.freebsd.org/D13780 for a possible / potential fix.
And an alternative proposal: https://reviews.freebsd.org/D13828
(In reply to Andriy Gapon from comment #27) thanks, with that patch (D13780), I also am able to use multiple vCPUs at every stage of Windows pleasure. When will it go upstream?
(In reply to Nils Beyer from comment #29) I still cannot decide between D13780 and D13828. I have given some light testing to both, both seem to work.
Please check in D13780 - I much prefer that one unless the later version can be shown to have better performance.
(In reply to Andriy Gapon from comment #30) well, performance-wise I did a Cinebench R15 (RC184115DEMO) benchmark (CPU) under Windows 10 (latest release) with both patch variants - here are the results: D13780 - CB-Results: 484, 483, 484 D13828 - CB-Results: 481, 482, 479 no much difference. Regarding stability (production-quality-wise) I can't say anything... Yet. For giggles, here's the Cinebench info panel's content: -------------------------------------------------------------------------- Processor: AMD Ryzen 7 1700 Eight-Core Processor Cores x GHz: 4 Cores, 4 Threads @3.00 GHz OS: Windows 8, 64Bit, Professional Edition GFX Board: <empty> --------------------------------------------------------------------------
A commit references this bug: Author: avg Date: Wed Jan 31 11:14:26 UTC 2018 New revision: 328622 URL: https://svnweb.freebsd.org/changeset/base/328622 Log: vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR fields of VMCB to inject a virtual interrupt into a guest VM. This method has many advantages over the direct event injection as it offloads all decisions of whether and when the interrupt can be delivered to the guest. But with a purely software emulated vAPIC the advantage is also a problem. The problem is that the hypervisor does not have any precise control over when the interrupt is actually delivered to the guest (or a notification about that). Because of that the hypervisor cannot update the interrupt vector in IRR and ISR in the same way as real hardware would. The hypervisor becomes aware that the interrupt is being serviced only upon the first VMEXIT after the interrupt is delivered. This creates a window between the actual interrupt delivery and the update of IRR and ISR. That means that IRR and ISR might not be correctly set up to the point of the end-of-interrupt signal. The described deviation has been observed to cause an interrupt loss in the following scenario. vCPU0 posts an inter-processor interrupt to vCPU1. The interrupt is injected as a virtual interrupt by the hypervisor. The interrupt is delivered to a guest and an interrupt handler is invoked. The handler performs a requested action and acknowledges the request by modifying a global variable. So far, there is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0 notices the acknowledgment and sends another IPI with the same vector. The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the second IPI has never been sent. The scenario is impossible on the real hardware because IRR and ISR are updated just before the interrupt handler gets started. I saw several possibilities of fixing the problem. One is to intercept the virtual interrupt delivery to update IRR and ISR at the right moment. The other is to deliver the LAPIC interrupts using the event injection, same as legacy interrupts. I opted to use the latter approach for several reasons. It's equivalent to what VMM/Intel does (in !VMX case). It appears to be what VirtualBox and KVM do. The code is already there (to support legacy interrupts). Another possibility was to use a special intermediate state for a vector after it is injected using a virtual interrupt and before it is known whether it was accepted or is still pending. That approach was implemented in https://reviews.freebsd.org/D13828 That method is more complex and does not have any clear advantage. Please see sections 15.20 and 15.21.4 of "AMD64 Architecture Programmer's Manual Volume 2: System Programming" (publication 24593, revision 3.29) for comparison between event injection and virtual interrupt injection. PR: 215972 Reported by: ajschot@hotmail.com, grehan Tested by: anish, grehan, Nils Beyer <nbe@renzel.net> Reviewed by: anish, grehan MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13780 Changes: head/sys/amd64/vmm/amd/svm.c
(In reply to Nils Beyer from comment #32) Thank you for testing! I've just committed D13780 based on Peter's guidance and your testing.
thank you very much. Any chance to get that in 11-STABLE as well?
(In reply to Nils Beyer from comment #35) sorry guys; please forget my last comment. Didn't see that MFC note...
A commit references this bug: Author: avg Date: Thu Feb 15 17:09:48 UTC 2018 New revision: 329320 URL: https://svnweb.freebsd.org/changeset/base/329320 Log: MFC r328622: vmm/svm: post LAPIC interrupts using event injection PR: 215972 Changes: _U stable/11/ stable/11/sys/amd64/vmm/amd/svm.c
A commit references this bug: Author: avg Date: Thu Feb 15 17:10:42 UTC 2018 New revision: 329321 URL: https://svnweb.freebsd.org/changeset/base/329321 Log: MFC r328622: vmm/svm: post LAPIC interrupts using event injection PR: 215972 Changes: _U stable/10/ stable/10/sys/amd64/vmm/amd/svm.c
It seems I'm running into this issue still running FreeBSD 12.0-CURRENT as the guest and trying to run make buildworld. Host: 11.1-RELEASE-p10 Guest: 12.0-CURRENT Stacktrace --- spin lock 0xffffffff81d42760 (smp rendezvous) held by 0xfffff800040c0560 (tid 100089) too long panic: spin lock held too long cpuid = 3 time = 1525935605 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000046e570 vpanic() at vpanic+0x18d/frame 0xfffffe000046e5d0 panic() at panic+0x43/frame 0xfffffe000046e630 _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x8c/frame 0xfffffe000046e650 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd5/frame 0xfffffe000046e6c0 __mtx_lock_spin_flags() at __mtx_lock_spin_flags+0xd8/frame 0xfffffe000046e700 smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0xd8/frame 0xfffffe000046e780 smp_masked_invlpg_range() at smp_masked_invlpg_range+0x42/frame 0xfffffe000046e7b0 pmap_invalidate_range() at pmap_invalidate_range+0x291/frame 0xfffffe000046e810 pmap_remove_ptes() at pmap_remove_ptes+0xae/frame 0xfffffe000046e870 pmap_remove() at pmap_remove+0x404/frame 0xfffffe000046e8f0 _kmem_unback() at _kmem_unback+0x43/frame 0xfffffe000046e930 kmem_free() at kmem_free+0x37/frame 0xfffffe000046e950 zone_drain_wait() at zone_drain_wait+0x374/frame 0xfffffe000046e9b0 arc_kmem_reap_now() at arc_kmem_reap_now+0xa4/frame 0xfffffe000046e9e0 arc_reclaim_thread() at arc_reclaim_thread+0x2e5/frame 0xfffffe000046ea70 fork_exit() at fork_exit+0x84/frame 0xfffffe000046eab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000046eab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 8 tid 100056 ] Stopped at kdb_enter+0x3b: movq $0,kdb_why Sysctls --- hw.vmm.npt.pmap_flags: 507 hw.vmm.svm.num_asids: 32768 hw.vmm.svm.disable_npf_assist: 0 hw.vmm.svm.features: 113919 hw.vmm.svm.vmcb_clean: 959 hw.vmm.vmx.vpid_alloc_failed: 0 hw.vmm.vmx.posted_interrupt_vector: -1 hw.vmm.vmx.cap.posted_interrupts: 0 hw.vmm.vmx.cap.virtual_interrupt_delivery: 0 hw.vmm.vmx.cap.invpcid: 0 hw.vmm.vmx.cap.monitor_trap: 0 hw.vmm.vmx.cap.unrestricted_guest: 0 hw.vmm.vmx.cap.pause_exit: 0 hw.vmm.vmx.cap.halt_exit: 0 hw.vmm.vmx.initialized: 0 hw.vmm.vmx.cr4_zeros_mask: 0 hw.vmm.vmx.cr4_ones_mask: 0 hw.vmm.vmx.cr0_zeros_mask: 0 hw.vmm.vmx.cr0_ones_mask: 0 hw.vmm.ept.pmap_flags: 0 hw.vmm.vrtc.flag_broken_time: 1 hw.vmm.ppt.devices: 0 hw.vmm.iommu.enable: 1 hw.vmm.iommu.initialized: 0 hw.vmm.bhyve_xcpuids: 8346 hw.vmm.topology.cpuid_leaf_b: 1 hw.vmm.topology.cores_per_package: 2 hw.vmm.topology.threads_per_core: 1 hw.vmm.create: beavis hw.vmm.destroy: beavis hw.vmm.trace_guest_exceptions: 0 hw.vmm.ipinum: 251 hw.vmm.halt_detection: 1 Bhyve options (running bhyve using https://github.com/churchers/vm-bhyve as a frontend if need be I can see if I can get it to spit out the full command rather than just the options passed) --- May 09 20:05:33: [bhyve options: -c 4 -m 6G -AHPw -U 84b02223-f0d7-11e7-a8e5-1c1b0de910d7] May 09 20:05:33: [bhyve devices: -s 0,hostbridge -s 31,lpc -s 4:0,virtio-blk,/bhyve/fbsd-current/disk0.img -s 5:0,virtio-net,tap0,mac=58:9c:fc:0b:23:f9] May 09 20:05:33: [bhyve console: -l com1,stdio] CPU info --- hw.model: AMD Ryzen 7 1700 Eight-Core Processor hw.machine: amd64 hw.ncpu: 16 My FreeBSD 12-Current guest is the only one I have problems with so fair (also have a Linux guest and another BSD guest but neither have done anything CPU intensive)
Can you provide host 11.1 change number? Andiy's fix r328622 is in 11-stable https://svnweb.freebsd.org/base/stable/11/sys/amd64/vmm/amd/svm.c?view=log, just want to confirm.
Sorry didn't realize this was still only on the STABLE branch. As my host is currently on RELEASE branch I probably won't get the patch until 11.2.
*** Bug 215377 has been marked as a duplicate of this bug. ***
(In reply to Adam Jimerson from comment #41) Should we try to push an EN for this issue?