On FreeBSD-11 and on older versions using AIO VirtualBox can lose disk access when the disk is heavily loaded. The symptoms are either the guest locking up with the guest's system "disk" showing continuous access while the host system shows no disk activity or by a crash on the guest. This can be fixed by adjusting the AIO sysctls to allow added capacity before disk requests are lost. While I have no fix and this issue has existed for a long time, I suggest that pkg-message have an added warning Here is a first-cut that could use some word-smithing: If AIO is enabled (always in FreeBSD-11.0 and newer), guests may lose disk access. This can be mitigated by adjusting AIO sysctls as follows: vfs.aio.aiod_lifetime=30000 vfs.aio.max_aio_procs=4 vfs.aio.max_aio_queue=65536 vfs.aio.max_aio_queue_per_proc=65536 vfs.aio.max_aio_per_proc=8192 vfs.aio.max_buf_aio=8192 These lines may be added to /etc/sysctl.conf to make them default.
In-flight AIO requests require kernel resources (threads, kernel memory), because of this the kernel limits the concurrency. This sounds like VirtualBox tries to issue too many AIO requests and ignores the errors. Just increasing the limits to something that doesn't fail on your system with your workload doesn't fix such a but. At best (or worst) it hides the bug. An application is expected to handle temporary AIO errors. Possible strategies include falling back to synchronous IO, retrying the AIO with an exponential backoff or just reporting the temporary error to the next layer.
The use of large amounts of KVM by these AIO parameters is a concern tome and I have been testing with more moderate values to see if these make VBox stable. I have now reduced the two biggest changes to: vfs.aio.max_aio_queue=8192 vfs.aio.max_aio_queue_per_proc=1024 So far, after two days of use, my guest system has remained stable. I should also note that the maximum number of AIO processes is reduced from 32 to 8 which will tend to limit the KVM consumed substantially. I will also be looking at the 10x increase in process lifetime. It is conceivable that this is either critical to working around the problem or is unneeded. Those more familiar with the inner workings of the AIO daemons might have a better idea than I do. I will continue to adjust parameters, but must do that slowly as it can take some time before the problem pops up depending on disk load (whether from VBox or other activity). I am testing on a older (5 yrs) laptop with an 2.5 GHz Sandy Bridge i5 CPU and standard HGST disk 1TB disk, so disk I/O will not be impressive.
I am now running with a more conservative set of AIO parameters and have had no problems. N.B. I only run a single instance of Windows 7 or, on rare occasion, Linux Mint, so this might not be appropriate for servers running multiple VMs at the same time. # From my sysctl.conf Default vfs.aio.max_aio_procs=4 # (32) vfs.aio.aiod_lifetime=30000 # (3000) vfs.aio.max_aio_queue=8192 # (1024) vfs.aio.max_aio_queue_per_proc=1024 # (256) vfs.aio.max_aio_per_proc=128 # (32) vfs.aio.max_buf_aio=64 # (16) I will shortly reduce these further and see how it goes. This does not preclude the possibility that the modifications to the port could fix this far batter, but I have not had an issue with either kernel memory or error in the VMs with these adjustments.
(In reply to rkoberman from comment #3) 11 months and it all remains stable. Since these parameters have been stable and I have not had issues with exhausting KVM, I'm just leaving it alone although I suspect a couple may still be excessive.
The sysctl aio tweaks improve the stability of VB running a Windows7 guest, but heavy IO still caused a problem for me. For example, I was able to get through the Windows installation after making the sysctl changes, but later, during installing of many Windows updates, Windows locks up and eventually a BSOD appears with complaints about possible hardware issues. This is running virtualbox-ose-5.1.26 on 11.1-RELEASE-p1 amd64. P.S. 30000 already is the default for vfs.aio.aiod_lifetime on all my systems running 11.1.
This sounds like a duplicate of bug 168298, which is also fixed (or at least, hidden) by higher aio resource limits. *** This bug has been marked as a duplicate of bug 168298 ***