Bug 222916 - [bhyve] Debian guest kernel panics with message "CPU#0 stuck for Xs!"
Summary: [bhyve] Debian guest kernel panics with message "CPU#0 stuck for Xs!"
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-10 21:06 UTC by karihre
Modified: 2023-12-05 18:08 UTC (History)
2 users (show)

See Also:


Attachments
Syslog output of Debian guest leading up to hanging (602.41 KB, application/x-bzip2)
2017-10-10 21:06 UTC, karihre
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description karihre 2017-10-10 21:06:15 UTC
Created attachment 187061 [details]
Syslog output of Debian guest leading up to hanging

Dear all,

I've been having a fair amount of crashes on a Debian 9 (kernel 4.9.0) guest running under the bhyve hypervisor on FreeBSD 11.1-RELEASE-p1 (GENERIC). The general pattern is that the guest stays up for 2-6 days before completely hanging, at which point it must be hard restarted.

This appears to happen under some load (often following the start of some cron jobs) and be preceded some hours of kernel warnings on the guest side. I find nothing out of the ordinary on the host side with logs not appearing to reveal much of anything.

CPU is Intel Xeon E3-1275 v6 Kaby Lake and motherboard is Supermicro MBD-X11SSH-LN4F-O. Bhyve startup command line is:
  bhyve -AHP \
    -s 0:0,hostbridge \
    -s 1:0,lpc \
    -s 2:0,virtio-net,tap0 \
    -s 3:0,virtio-net,tap1 \
    -s 4:0,virtio-blk,/dev/zvol/tank/vms/strokkur-root \
    -s 5:0,virtio-blk,/dev/zvol/tank/vms/strokkur-scratch \
    -s 6:0,virtio-blk,/dev/zvol/tank/vms/strokkur-temp \
    -s 29,fbuf,tcp=127.0.0.1:5900,w=800,h=600 \
    -l com1,stdio \
    -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \
    -c 4 \
    -m 32G strokkur

A logged in user will see "NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [apache2:18004]" or a similar message, however this message is not logged to syslog.
The relevant Debian syslog entries leading up to the most recent crash are attached. In total, each crash produges roughly 600MB of log entries, so I've cut out a lot. 

If I can provide any more information please let me know.

Thank you,
Kari Hreinsson
Comment 1 karihre 2017-10-10 21:37:44 UTC
And while I remember, hyperthreading on the host is disabled, sysutils/devcpu-data is installed and microcode_update_enable="YES" is present in /etc/rc.conf .
Comment 2 Peter Grehan freebsd_committer freebsd_triage 2017-10-14 01:33:22 UTC
The "CPU stuck" messages generally occurr when a system is oversubscribed i.e. the guest has been given more virtual resources than are physically available.

In this case, the guest has been assigned the same number of vCPUs as the host (assuming Intel-ARK is correct here: https://ark.intel.com/products/97478/Intel-Xeon-Processor-E3-1275-v6-8M-Cache-3_80-GHz) so if there is a time when both the host and guest require all CPUs, there is only 4 available to service the needs of 8 assumed available (4 host, 4 guest).

In addition, the guest has been assigned 32G of RAM, and ZFS is being used on the host. The issue there is that the default ZFS setup on FreeBSD will allow use of all RAM minus 1GB for ZFS ARC. This competes with use by bhyve (swap-backed RAM), and can result in excessive swapping by the bhyve process. This can also result in "CPU stuck" messages as vCPUs are halted while awaiting guest physical memory to be paged in.

A recommendation would be to
   a) restrict ZFS ARC usage to less than that required by the host and bhyve guests. In this case, perhaps 16-24G (if the mobo has 64G) ? This can be done using the vfs.zfs.arc_max parameter in /boot/loader.conf
   b) Only use 2 vCPUs for the guest, or, enable hyperthreading.
Comment 3 karihre 2017-10-14 01:52:06 UTC
Thank you for the quick reply. Hyperthreading was initially enabled, with the guest assigned 4 vCPUs, but disabled as part of the debug process due to recent problems with hyper-threading on Intel processors.

The memory in the machine was indeed 64GB, and ZFS ARC usage was not being limited like you suggested. I have now set the limit to 20GB and plan to reboot soon to see if this resolves the issue. I have also (since last boot of the guest) limited the vCPUs to 1 as part of the debug process. The system has yet to hang, but in general it survived a few days before running into this issue, so the verdict is still in on that mitigation.

I will post as soon as I get the next crash, or consider this resolved if everything appears stable for the coming weeks. I'm not sure if you want to mark the bug as closed, or it can remain open until something (or nothing) happens.

Thank you,
Kari Hreinsson
Comment 4 Peter Grehan freebsd_committer freebsd_triage 2017-10-14 02:21:20 UTC
Single vCPU guests perform very well in oversubscribed environments since there are no lock-spins against other possibly de-scheduled vCPUs. However, you still may hit the ZFS memory competition issue. Note that you can set the ARC-max on the fly in 11.1 and avoid the reboot, using the vfs.zfs.arc_max sysctl var. ZFS should slowly drop down to this value if it is above it.

>not sure if you want to mark the bug as closed

Fine to keep this open until the issue is resolved.
Comment 5 karihre 2017-10-14 06:15:32 UTC
Thanks for that tip. I reduced the ARC size with sysctl and confirmed it to be 20GB with zfs-info.

Thinking back to the 4 guest / 4 host cpus. Lets say the collection of guests consume 4 cpus and four tasks on the host consume 4 cpus (totaling a load average of 8), does the host system scheduler not shuffle tasks around like it would if I were running 8 cpu intensive processes on the host? Or does the interaction between bhyve and the host scheduler somehow result in the virtual cpus being set aside for tens of seconds?

I guess I'm just trying to understand, I would think one of the main motivations for using a hypervisor is exactly over-subscribing cpu cores as you may have guests with "bursty" load behavior, so on average your total guests+host load is less than the number of cpus, but surely you can divide the cpu time in a "fair" manner when the system is overloaded.

Memory I would think is a little trickyer, there it makes sense to make sure the host system consumption + guest consumption never exceeds the total host memory.

Anyhow, just trying to make sense of this, there doesn't seem to be too much information available online on these topics, or perhaps I'm looking in all the wrong places.

Thank you,
Kari
Comment 6 Peter Grehan freebsd_committer freebsd_triage 2017-10-14 17:12:44 UTC
>Or does the interaction between bhyve and the host scheduler somehow
>result in the virtual cpus being set aside

 Yes, though:

> for tens of seconds?

 The error message from Linux is a bit misleading. There is a low-priority kernel thread that tries to run every 5 seconds and then sleeps. If it hasn't been able to run for an extended amount of time, for example due to high interrupt activity, higher priority threads running, or spinlocks being held, the error message will be displayed.

 What I believe you are seeing is a classic hypervisor problem, not specific to bhyve, known as "lock-holder preemption" where a vCPU holding a spin-lock is preempted by the host, and other vCPUs that are running then spin attempting to acquire that lock which can't be released. A search will show the large amount of literature on this issue :)

 Maybe the best reading on this is the ESXi scheduler paper:
    http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-vsphere-cpu-sched-performance-white-paper.pdf
 
 There has been some talk of putting knowledge of vCPUs in the FreeBSD scheduler to allow some form of gang scheduling, but nothing has come of that so far.

 As to your point; it's more than just fairness that the hypervisor scheduler has to provide - heuristics about guest o/s behaviour are also needed.
Comment 7 karihre 2017-11-06 21:59:15 UTC
So the vm has now been running for more than 3 weeks without problems, with 4 vcpus and 32 GB of memory. I think it is safe to say this was not a bug, but in fact a memory oversubscription. Thank you for your quick comments/advise!

Hopefully this (non-)bug report will shed a light on similar problems for others. Marking this as closed/not-a-bug, feel free to change the status to something else as appropriate.
Comment 8 Sean McBride 2023-12-05 18:08:13 UTC
I realize this is "Closed Not A Bug", but it sounds a lot like this to me:

https://reviews.freebsd.org/D39620#978525