Bug 237636 - bhyve guest ZFS filesystem freezes in zcw->zcw_cv state
Summary: bhyve guest ZFS filesystem freezes in zcw->zcw_cv state
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-29 05:18 UTC by Morgan Davis
Modified: 2019-06-01 05:15 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Morgan Davis 2019-04-29 05:18:48 UTC
I experience random freezes during file I/O that locks up the entire virtual machine (except networking). Environment is FreeBSD 12.0-RELEASE bhyve guest running with ZFS on a FreeBSD 12.0-RELEASE host also on ZFS.

During some specific disk operations, such as a sqlite database vacuum or git 'gc' operation, the VM will freeze up and cease performing disk operations.  The VM can still respond to pings and accept network connections, but any disk operation will just cause complete blocking.  The only remedy is to kill -9 the bhyve host process and restart the guest.

Trying Control-T during one of these episodes always shows the kernel stuck in zcw->zcw_cv.  For example, this "git gc" operation:

Enumerating objects: 12282, done.
Counting objects: 100% (12282/12282), done.
Delta compression using up to 4 threads
Compressing objects: 100% (12259/12259), done.
<--- system completely stops responding here, control-T entered -->
load: 0.47  cmd: git 90093 [zcw->zcw_cv] 31.31r 9.77u 0.09s 0% 123612k

This does not always happen and is not reproducible at will. However, the freezes are occurring mostly with sqlite and occasionally with git.

Host system has 16GB RAM and guest has 5GB allocated.  The host does nothing but run this one guest bhyve instance.

Guest's /boot/loader.conf disables ZFS prefetching:

vfs.zfs.prefetch_disable="1"

And after much searching for help on the freezing, I found this article which suggested reducing ZFS arc max. https://www.reddit.com/r/freebsd/comments/1u358s/freebsd_random_freeze_where_to_start/

So I added this to loader.conf, too:

vfs.zfs.arc_max="1G"

This occurs on FreeBSD 11 and now 12 as well.  I don't recall having this issue under FreeBSD 10.
Comment 1 Morgan Davis 2019-05-02 21:04:57 UTC
Today, this happened again during an rsync.  I mention this so that concern is not misdirected at sqlite or git.  

It seems that it occurs during relatively intensive file operations.  I suppose if I was doing something else, like compressing a batch of files using any archiver/compressor I might see the same thing happen.

I have equivalent ZFS-backed servers doing the same work, if not more intensive file I/O, that have never had this problem. But they are running on either dedicated hardware (not virtualized) or are running on a non-bhyve VM architecture (e.g., a VPS, AWS, etc.).

I'm convinced the issue is related to bhyve + ZFS (host and guest).
Comment 2 Julien Cigar 2019-05-03 12:44:36 UTC
might be good to provide a procstat -kk -a | grep zfs when the freeze occurs.. (maybe this is the same problem as bug #236220)
Comment 3 Morgan Davis 2019-05-03 18:31:57 UTC
The aftereffects of the issue affecting my development bhyve system sound very similar to those in bug #236220, and maybe they are related.

However, this issue doesn't affect 9 other FreeBSD servers w/ ZFS that I run with a similarly outfitted FreeBSD 12.0-RELEASE-p3, and many are far more heavily burdened with work in smaller RAM space. The only key difference is that this deadlock occurs only my very lightly loaded bhyve guest server.

By comparison, my other systems are far more abused. One is a 2GB instance at DigitalOcean that acts as a data backup repository. It runs MySQL as a replication server and also the endpoint for lots of rsync pushes. It's constantly churning disk I/O with 13% swap in use (due to MySQL). In addition, it has a cron job that regularly cycles ZFS snapshots across 33 datasets for historical archiving. Never once has it, or any others I manage, froze up like my bhyve guest. The only time I need to reboot it is for FreeBSD revision updates or reconfigurations that I want to test to make sure a restart occurs properly. Right now, it has 28 days uptime.

Given this success everywhere else except for this one bhyve guest instance, I'm not sure my issue is related to the same one affecting those in bug #236220 which 

I will try to issue the procstat command the next time this happens, but I have found that trying to run any command from an open bash prompt usually just stalls as well. I'll omit the pipe to grep to eliminate another possible deadlock trigger.
Comment 4 Morgan Davis 2019-05-08 21:44:32 UTC
This happened again today while running rsync.  But this time a Control-T reports a different state: zio->io_cv

load: 0.04  cmd: rsync 98579 [zio->io_cv] 89.97r 0.04u 0.09s 0% 8108k

No luck getting a procstat to run -- it just locks up upon issuing the command, and it is highly unlikely this is possible once the system gets into this state.

If there us a way for capturing this when it happens, please advise.
Comment 5 Morgan Davis 2019-05-31 20:36:41 UTC
New instance today with different output from control-T:

load: 0.45  cmd: node 95591 [kqread] 104.57r 1.02u 0.15s 0% 134408k
make: Working in: /usr/local/dtl/src/client/jeromes.com
make[1]: Working in: /usr/local/dtl/src/client/jeromes.com/ts


This occurred in a make process running the typescript compiler (node).

I've had several freeze-ups prior to this but they were similar to previous ones I reported, so I didn't bother to document those.