Bug 231117 - bhyve: I/O lockups inside guests
Summary: bhyve: I/O lockups inside guests
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-virtualization mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-09-03 11:17 UTC by Niels Kristensen
Modified: 2019-07-26 21:47 UTC (History)
15 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Niels Kristensen 2018-09-03 11:17:09 UTC
We're running bhyve VMs on ~10-20 hypervisors.

Since upgrading everything to 11.2-RELEASE last week, we've seen i/o lockups inside different VMs on multiple of these hypervisors. The VMs also run FreeBSD 11.2, and both the hypervisors and the VMs run ZFS for backup and quota reasons.

We havn't had any issues on 11.1, but when we ran 11.0, we had some problems with the ARC inside the VMs. The VM would be unable to make any disk i/o as we experience now as well, but in 11.0 the problem could be "resolved" by launching a memory hungry bin from tmpfs, causing ARC to release some memory, and then we got working i/o again.

On 11.0 the affected VMs were generally i/o intensive VMs. Now we see random lockups on different workloads.

The zvol timestamp on the hypervisors are not updated, so clearly writes fail to go through.

Does anyone have some advise for debugging this issue further?
Comment 1 Niels Kristensen 2018-09-07 05:56:27 UTC
We've tried to downgrade a couple of VMs to 11.1. It seems to "resolve" the issue.
Comment 2 punkt.de Hosting Team 2018-09-07 06:59:26 UTC
Hi Niels,

being more of an operator than a developer I don't have any insight into the root cause of this, but we run a ton of jails and a couple of bhyve VMs on FreeBSD in production and we found that

* The automatic adjustment of ARC size under memory pressure is still not working perfectly.
* It helps to statically limit the ARC size (vfs.zfs.arc_max) on the host system - we use a quarter of physical memory, so VMs have some breathing room.
* I would definitely limit the ARC size inside each VM
* Are you running the virtual disks as AHCI or virtio? We use virtio throughout.

HTH
Patrick
Comment 3 Michael Gmelin freebsd_committer 2018-09-07 16:11:25 UTC
As this seems to happen on many VMs, it might be this issue #229614, which happens on AWS. So it's probably not because of the hypervisor.

Does it happen during nightly jobs? Does raising kern.maxvnodes make it go away/less frequent?
Comment 4 Michael Gmelin freebsd_committer 2018-09-07 16:12:39 UTC
(In reply to Michael Gmelin from comment #3)

See #229614
Comment 5 Kristian K. Christensen 2018-09-10 16:36:16 UTC
Hi Michael,

Thanks for the update, I'll look into #229614 and try to establish a test environment. We've been forced to downgrade, in order to keep production running.

It does seem that downgrade on the VM, not the hypervisor, solves the issue. From my brief scanning of the other issue, it appears that it would atleast appear that it could be the same issue.

We downgraded all affected vms to newest 11.1-RELEASE.
Comment 6 Kristian K. Christensen 2018-10-07 10:08:12 UTC
So, after a bit of testing, regretably a fair amount of it in production:

It would seem that boosting vnodes helped initially, though eventually it didn't matter. Some systems run without problems, it seems the more memory, the more stable. I have no conclusive evidence of this though. Others seem to die increasingly frequently over time. Eventually getting to the point where the system is inresponsive about 30 minutes after last forced reboot. Based on this, I have a hard time figuring out if it's related to the other issue. As I am unable to log in, once the condition occurs, getting any sort of debug output is tricky as well.

Downgrading to 11.1 fixes the issue. I'll stay here for a bit, wait for #229614 to get resolved and ported to 11.2, after which I'll give it another go.
Comment 7 commit-hook freebsd_committer 2018-12-11 19:35:35 UTC
A commit references this bug:

Author: allanjude
Date: Tue Dec 11 19:34:25 UTC 2018
New revision: 341828
URL: https://svnweb.freebsd.org/changeset/base/341828

Log:
  MFC: r339289: Resolve a hang in ZFS during vnode reclaimation

    This is caused by a deadlock between zil_commit() and zfs_zget()
    Add a way for zfs_zget() to break out of the retry loop in the common case

  PR:		229614, 231117
  Reported by:	grembo, jhb, Andreas Sommer, others
  Relnotes:	yes
  Sponsored by:	Klara Systems

Changes:
_U  stable/11/
  stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
Comment 8 Kristian K. Christensen 2018-12-11 20:39:36 UTC
(In reply to commit-hook from comment #7)

Is this already in 12.0-RELEASE ?

As sadly, we tested 12.0-RELEASE today, with no success. We still get complete I/O lockup after a few hours.
Comment 9 commit-hook freebsd_committer 2018-12-19 18:06:21 UTC
A commit references this bug:

Author: emaste
Date: Wed Dec 19 18:05:50 UTC 2018
New revision: 342226
URL: https://svnweb.freebsd.org/changeset/base/342226

Log:
  MFS11 r341828: Resolve a hang in ZFS during vnode reclaimation

    This is caused by a deadlock between zil_commit() and zfs_zget()
    Add a way for zfs_zget() to break out of the retry loop in the common case

  PR:		229614, 231117
  Submitted by:	allanjude
  Approved by:	so
  Security:	FreeBSD-EN-18:18.zfs
  Sponsored by:	Klara Systems, The FreeBSD Foundation

Changes:
_U  releng/11.2/
  releng/11.2/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
Comment 10 Allan Jude freebsd_committer 2018-12-19 18:36:49 UTC
(In reply to Kristian K. Christensen from comment #8)
Can you provide the `procstat -kk <pid>` output of the processes that hang?
Comment 11 Kristian K. Christensen 2018-12-19 19:36:29 UTC
(In reply to Allan Jude from comment #10)

I'll give it a shot, but expect it to take a little while to get the output.
Let me quickly explain:
Currently, the only thing that stops doing io is our bhyve instances running anything 11.2 or newer. (11.1 rarely ever have issues, though it *does* happen, see Niels' comment on how it differs). However, it's the entire instance that stops doing io. The zvol simply receives no updates from here on out.

Regretably, on 11.2+, it's mostly on production vms we can effectively reproduce it often enough to debug on it, so we tend to roll them back after testing. I can reproduce on test vms, but they crash more infrequently and thus it could be a long wait. Furthermore, ALL writes stop and any process that wants to write to anything, stalls. This easily gives me a pid to debug with, but very little access to run commands on the vm. The only fix is to kill the bhyve pid and start it up again. We can run 11.2+ on the hypervisor, it's only running 11.2+ inside the VM that causes the hang. VMs running UFS do not hang. I'm going to leave a couple of sessions logged in and run procstat and ps fairly often, to keep them in memory, but I suspect it can take me a little bit to get the required output.

I'll get back to you, I'm going to assume zfskern is the most interesting pid to get output from, but if any random that can't write will do, let me know.
Comment 12 Kristian K. Christensen 2018-12-22 07:30:15 UTC
(In reply to Kristian K. Christensen from comment #11)

So, I can reproduce the issue on a test vm, regretably I can't get ps output or run procstat, they both hang.

On the hypervisor, I get the following from procstat -kk on the bhyve pid:
machiavelli# procstat -kk 35743
  PID    TID COMM                TDNAME              KSTACK                       
35743 100908 bhyve               mevent              mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 kqueue_kevent+0x290 kern_kevent+0xb5 sys_kevent+0xa4 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101987 bhyve               vtnet-2:0 tx        mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101988 bhyve               blk-3:0-0           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101989 bhyve               blk-3:0-1           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101990 bhyve               blk-3:0-2           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101991 bhyve               blk-3:0-3           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101992 bhyve               blk-3:0-4           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101993 bhyve               blk-3:0-5           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101994 bhyve               blk-3:0-6           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101995 bhyve               blk-3:0-7           mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 
35743 101996 bhyve               vcpu 0              mi_switch+0xe6 sleepq_timedwait+0x2f msleep_spin_sbt+0x161 vm_run+0x7d7 vmmdev_ioctl+0x85f devfs_ioctl_f+0x128 kern_ioctl+0x26d sys_ioctl+0x15c amd64_syscall+0xa38 fast_syscall_common+0x101 

The system isn't completely dead, but anything at all that requires io hangs indefinitely. I can type on the console, provided it was logged in beforehand, but trying to tab complete or run anything that's not in memory, hangs.

I'm going to try to run procstat and ps more frequently in cron, hoping to keep them in memory. Any other suggestions as to how to provide you with any useful input would be welcomed.
Comment 13 Kristian K. Christensen 2018-12-22 12:15:35 UTC
(In reply to Kristian K. Christensen from comment #12)

So success - procstats below. Is there anything else that would be helpful?

zfskern:

  PID    TID COMM                TDNAME              KSTACK                       
    5 100034 zfskern             arc_reclaim_thread  mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c arc_reclaim_thread+0x286 fork_exit+0x83 fork_trampoline+0xe 
    5 100035 zfskern             arc_dnlc_evicts_thr mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e arc_dnlc_evicts_thread+0xff fork_exit+0x83 fork_trampoline+0xe 
    5 100037 zfskern             dbuf_evict_thread   mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c dbuf_evict_thread+0x127 fork_exit+0x83 fork_trampoline+0xe 
    5 100054 zfskern             l2arc_feed_thread   mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c l2arc_feed_thread+0x2e9 fork_exit+0x83 fork_trampoline+0xe 
    5 100302 zfskern             trim zroot          mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c trim_thread+0xab fork_exit+0x83 fork_trampoline+0xe 
    5 100320 zfskern             txg_thread_enter    mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e txg_quiesce+0x17b txg_quiesce_thread+0x166 fork_exit+0x83 fork_trampoline+0xe 
    5 100321 zfskern             txg_thread_enter    mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e txg_sync_thread+0x1cb fork_exit+0x83 fork_trampoline+0xe 
    5 100322 zfskern             solthread 0xfffffff mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zthr_procedure+0x9b fork_exit+0x83 fork_trampoline+0xe 
    5 100323 zfskern             solthread 0xfffffff mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zthr_procedure+0x9b fork_exit+0x83 fork_trampoline+0xe 

pkg upgrade -fy (what got it to lock up this time):
  PID    TID COMM                TDNAME              KSTACK                       
15809 100406 pkg                 -                   mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 kern_wait6+0x4cd sys_wait4+0x78 amd64_syscall+0xa38 fast_syscall_common+0x101 

(the second process):
  PID    TID COMM                TDNAME              KSTACK                       
15810 100344 pkg                 -                   mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zil_commit_impl+0xdeb zfs_freebsd_fsync+0xa7 VOP_FSYNC_APV+0x82 kern_fsync+0x1c3 amd64_syscall+0xa38 fast_syscall_common+0x101
Comment 14 Allan Jude freebsd_committer 2018-12-22 16:47:41 UTC
(In reply to Kristian K. Christensen from comment #13)
The arc_reclaim thread suggests you might be running very low on memory. Have you turned your vfs.zfs.arc_max to accommodate for the memory needed for the VMs?

However the fsync/zil_commit_impl() thread looks more like the issue my patch tries to address: #229614
Comment 15 Kristian K. Christensen 2018-12-22 17:03:06 UTC
(In reply to Allan Jude from comment #14)

In regards to ARC reclaim, this is zfskern INSIDE the VM. It's got a base install and about 1GB arc_max (4GB ram). It's currently only running a small python app I wrote, to generate io (writes 4096 1-4kb files and distributes them, before ultimately deleting them) and then I tend to run pkg upgrade -fy with a fairly large package set, which'll eventually cause it to flop.

I'll try to push it to newest 12.0-STABLE. I assumed that your patch was active in 12.0-RELEASE?
Comment 16 Kristian K. Christensen 2018-12-23 01:15:59 UTC
(In reply to Kristian K. Christensen from comment #15)

So, it seems to trigger a lot quicker on newest 12.0-STABLE, my second run on pkg, which is good news in terms of reproducing I suppose - the bad news is, it doesn't look much different:

FreeBSD 12.0-RELEASE-p1 r342369 GENERIC  amd64:

pkg (again, pkg tripped up):
  PID    TID COMM                TDNAME              KSTACK                       
 6732 100408 pkg                 -                   mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 zil_commit_impl+0x94d zfs_freebsd_fsync+0xa7 VOP_FSYNC_APV+0x7e kern_fsync+0x1a0 amd64_syscall+0x369 fast_syscall_common+0x101 


zfskern:
  PID    TID COMM                TDNAME              KSTACK                       
    5 100038 zfskern             arc_reclaim_thread  mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a arc_reclaim_thread+0x146 fork_exit+0x83 fork_trampoline+0xe 
    5 100039 zfskern             arc_dnlc_evicts_thr mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 arc_dnlc_evicts_thread+0x16f fork_exit+0x83 fork_trampoline+0xe 
    5 100041 zfskern             dbuf_evict_thread   mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a dbuf_evict_thread+0x1c8 fork_exit+0x83 fork_trampoline+0xe 
    5 100058 zfskern             l2arc_feed_thread   mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a l2arc_feed_thread+0x219 fork_exit+0x83 fork_trampoline+0xe 
    5 100307 zfskern             trim zroot          mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a trim_thread+0x11f fork_exit+0x83 fork_trampoline+0xe 
    5 100325 zfskern             txg_thread_enter    mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 txg_quiesce+0x21b txg_quiesce_thread+0x11b fork_exit+0x83 fork_trampoline+0xe 
    5 100326 zfskern             txg_thread_enter    mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 txg_sync_thread+0x13b fork_exit+0x83 fork_trampoline+0xe 
    5 100327 zfskern             solthread 0xfffffff mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 zthr_procedure+0xcc fork_exit+0x83 fork_trampoline+0xe 
    5 100328 zfskern             solthread 0xfffffff mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 zthr_procedure+0xcc fork_exit+0x83 fork_trampoline+0xe
Comment 17 Kristian K. Christensen 2018-12-28 13:09:58 UTC
So, after some extensive testing over the holidays:

FreeBSD 11.1 with as low as 100mb ARC max, doesn't excert this behavior.

FreeBSD 11.2 and 12.0 with 1024 ARC max and lower, excerts this behavior.

FreeBSD 11.2 and 12.0 with over 4GB ARC max does not hang. (I need to see how low I can make it go, but so far I've been unable to cause 4GB ARC max to hang)

It may seem odd to want to limit ARC, but we have experienced high ARC usage and applications (bhyve VMs and MySQL daemons notably) being killed due to OOM, if we don't enforce ARC limits. This causes a lot of nightly on-call activity and data loss, which is not really something we care for. Furthermore, this is on VMs that use a zvol as primary storage, on a host that has a fairly high ARC limit, so ideally the host handles most of the ARC.
Comment 18 roel 2019-03-14 16:53:29 UTC
Just had this occur again on a VM running under bhyve with 12.0-STABLE, checked out and compiled 6 days ago (r344917). VM host is running the exact same kernel. The modifications in zfs_znode.c are present, but we still had an issue after the system has been running for a couple of days.

VM has >4GB arc_max (so the workaround as described by Kristian doesn't work for us):

vfs.zfs.arc_min: 903779840
vfs.zfs.arc_max: 7230238720

Hypervisor:

vfs.zfs.arc_min: 8216929792
vfs.zfs.arc_max: 65735438336

Procstat -kk on the bhyve process:

root@cloud02:/home/roel # procstat -kk 18178
  PID    TID COMM                TDNAME              KSTACK                       
18178 101261 bhyve               mevent              mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a kqueue_kevent+0x297 kern_kevent+0xb5 kern_kevent_generic+0x70 sys_kevent+0x61 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101731 bhyve               vtnet-2:0 tx        mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101732 bhyve               blk-3:0:0-0         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101733 bhyve               blk-3:0:0-1         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101734 bhyve               blk-3:0:0-2         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101735 bhyve               blk-3:0:0-3         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101736 bhyve               blk-3:0:0-4         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101737 bhyve               blk-3:0:0-5         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101738 bhyve               blk-3:0:0-6         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101739 bhyve               blk-3:0:0-7         mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101740 bhyve               vcpu 0              <running>                    
18178 101741 bhyve               vcpu 1              mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101742 bhyve               vcpu 2              mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101743 bhyve               vcpu 3              mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101744 bhyve               vcpu 4              mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 
18178 101745 bhyve               vcpu 5              mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101
Comment 19 Mateusz Kwiatkowski 2019-05-07 08:02:58 UTC
I have very similar problem as described in this issue: I/O in guests hangs. I've experienced this with FreeBSD 11.2, 12.0 (both with ZFS inside) and Ubuntu 18.04 (ext4) guests.

This started happening after migrating guests from old hypervisor running 12.0 on Xeon, to the new on running CURRENT (r347183) on AMD Epyc. On old hypervisor these VMs were running stable for couple of months.
On the new hypervisor there's plenty of free resources. Swap is disabled.

Mem: 3761M Active, 1636M Inact, 5802M Wired, 51G Free
ARC: 4000M Total, 487M MFU, 3322M MRU, 3456K Anon, 129M Header, 59M Other
     2228M Compressed, 3202M Uncompressed, 1.44:1 Ratio

vfs.zfs.arc_min: 8215740416
vfs.zfs.arc_max: 52582796492


Procstat from bhyve provess:
root@utgard:~ # procstat -kk 95379
  PID    TID COMM                TDNAME              KSTACK
95379 101075 bhyve               mevent              mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 kqueue_kevent+0xa94 kern_kevent_fp+0x95 kern_kevent+0x9f kern_kevent_generic+0x70 sys_kevent+0x61 amd64_syscall+0x276 fast_syscall_common+0x101
95379 101258 bhyve               blk-4:0-0           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101259 bhyve               blk-4:0-1           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101260 bhyve               blk-4:0-2           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101261 bhyve               blk-4:0-3           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101262 bhyve               blk-4:0-4           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101263 bhyve               blk-4:0-5           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101264 bhyve               blk-4:0-6           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101265 bhyve               blk-4:0-7           mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101266 bhyve               vtnet-5:0 tx        mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101
95379 101267 bhyve               vcpu 0              mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f msleep_spin_sbt+0x144 vm_run+0x970 vmmdev_ioctl+0x7ea devfs_ioctl+0xca VOP_IOCTL_APV+0x63 vn_ioctl+0x124 devfs_ioctl_f+0x1f kern_ioctl+0x28a sys_ioctl+0x15d amd64_syscall+0x276 fast_syscall_common+0x101
95379 101268 bhyve               vcpu 1              mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f msleep_spin_sbt+0x144 vm_run+0x970 vmmdev_ioctl+0x7ea devfs_ioctl+0xca VOP_IOCTL_APV+0x63 vn_ioctl+0x124 devfs_ioctl_f+0x1f kern_ioctl+0x28a sys_ioctl+0x15d amd64_syscall+0x276 fast_syscall_common+0x101


I will be happy to provide more information to help solving this issue.
Comment 20 Mateusz Kwiatkowski 2019-05-07 09:32:11 UTC
Output of procstat -kk -a | grep zfs:

# procstat -kk -a | grep zfs
    0 100403 kernel              zfsvfs              mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _sleep+0x2da taskqueue_thread_loop+0xe1 fork_exit+0x84 fork_trampoline+0xe
    0 100771 kernel              zfs_vn_rele_taskq   mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _sleep+0x2da taskqueue_thread_loop+0xe1 fork_exit+0x84 fork_trampoline+0xe
   46 100365 zfskern             solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 zthr_procedure+0x12c fork_exit+0x84 fork_trampoline+0xe
   46 100366 zfskern             solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 zthr_procedure+0x12c fork_exit+0x84 fork_trampoline+0xe
   46 100367 zfskern             arc_dnlc_evicts_thr mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b arc_dnlc_evicts_thread+0x133 fork_exit+0x84 fork_trampoline+0xe
   46 100369 zfskern             dbuf_evict_thread   mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 dbuf_evict_thread+0x296 fork_exit+0x84 fork_trampoline+0xe
   46 100402 zfskern             l2arc_feed_thread   mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 l2arc_feed_thread+0x1d5 fork_exit+0x84 fork_trampoline+0xe
   46 100698 zfskern             trim zroot          mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 trim_thread+0xab fork_exit+0x84 fork_trampoline+0xe
   46 100812 zfskern             txg_thread_enter    mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b txg_thread_wait+0x9a txg_quiesce_thread+0x2a6 fork_exit+0x84 fork_trampoline+0xe
   46 100813 zfskern             txg_thread_enter    mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 txg_thread_wait+0x8d txg_sync_thread+0x53b fork_exit+0x84 fork_trampoline+0xe
   46 100814 zfskern             solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b zthr_procedure+0xeb fork_exit+0x84 fork_trampoline+0xe
   46 100815 zfskern             solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b zthr_procedure+0xeb fork_exit+0x84 fork_trampoline+0xe
Comment 21 Rodney W. Grimes freebsd_committer 2019-05-07 12:27:53 UTC
(In reply to Mateusz Kwiatkowski from comment #19)
> On the new hypervisor there's plenty of free resources. Swap is disabled.
> 
> Mem: 3761M Active, 1636M Inact, 5802M Wired, 51G Free
> ARC: 4000M Total, 487M MFU, 3322M MRU, 3456K Anon, 129M Header, 59M Other
>      2228M Compressed, 3202M Uncompressed, 1.44:1 Ratio
>
> vfs.zfs.arc_min: 8215740416
> vfs.zfs.arc_max: 52582796492

How much memory are your VM's using?  Ie, what is the total of all the VM's you have running at any time?  How much total memory does the system have?

Your arc max is 52Gbytes, 4G is in use now, so it could try to use another 48G bytes, your system has 51G free under the above conditions so the arc could drive your system to 3G of free memory.

I am concerned your host work load my at times drive your system into memory starvation unless these numbers are actually with all the system under load.
Comment 22 Mateusz Kwiatkowski 2019-05-07 18:26:50 UTC
(In reply to Rodney W. Grimes from comment #21)
> How much memory are your VM's using?  Ie, what is the total of all the VM's you have > running at any time?  How much total memory does the system have?

VMs may use 8GBs of RAM in total. Hypervisor has 64GB in total. So VMs and ARC could create OOM condition but the statistics I provided were taken at the moment when one of the VMs hang. I'll reduce arc_max anyway to 8GB to make sure this isn't the cause.
Comment 23 Rodney W. Grimes freebsd_committer 2019-05-07 19:50:40 UTC
(In reply to Mateusz Kwiatkowski from comment #22)
> VMs may use 8GBs of RAM in total. Hypervisor has 64GB in total. So VMs and ARC could create OOM condition but the statistics I provided were taken at themoment when one of the VMs hang. I'll reduce arc_max anyway to 8GB to make sure this isn't the cause.

Not sure you need to cut arc_max to 8GB, but just make sure that arc_max + total(all VM's) + host usage is < than 64GB.

I have been pointed at:
If the guest(s) are using virtio-block, it might be interesting to
check the status of the IO queues for those devices.  The reported
issues could be a case of https://smartos.org/bugview/OS-7613.
(Posted as https://reviews.freebsd.org/D19501)

I have pinged the D19501 review and it looks like some forward motion has happened there.  You *may* if you can want to try the patch that is there now, knowing that it is about to be revised.
Comment 24 Mateusz Kwiatkowski 2019-05-09 07:31:01 UTC
(In reply to Rodney W. Grimes from comment #23)

Rodney, many thanks for your engagement in this matter. It seems that you were right suspecting that arc_max is the culprit. After reducing its size virtual machines are working stable for over a day. So, instead of stability issue this was more like user experience issue. As you see we both were tricked by top(1) output which looked fine and didn't indicate problem. Is it possible to have a notice printed in dmesg that happens (as we have for other kernel limits)?
Comment 25 Mateusz Kwiatkowski 2019-05-09 09:58:01 UTC
(In reply to Mateusz Kwiatkowski from comment #24)
I'm afraid I celebrated too soon. I/O in VM locked up - it just took more time.

Memory stats:
Mem: 3846M Active, 4806M Inact, 26G Wired, 27G Free
ARC: 7755M Total, 2419M MFU, 4208M MRU, 14M Anon, 478M Header, 635M Other
     6145M Compressed, 10G Uncompressed, 1.68:1 Ratio 

I'll now test patch with virtio barriers.
Comment 26 Mateusz Kwiatkowski 2019-05-15 13:04:27 UTC
(In reply to Rodney W. Grimes from comment #23)
The VM running patched bhyve is stable since Sunday evening. This is 3 whole days and it didn't run so long without a crash before. I think we can assume that patch did the job!
Comment 27 commit-hook freebsd_committer 2019-05-18 19:32:50 UTC
A commit references this bug:

Author: rgrimes
Date: Sat May 18 19:32:39 UTC 2019
New revision: 347960
URL: https://svnweb.freebsd.org/changeset/base/347960

Log:
  bhyve virtio needs barriers

  Under certain tight race conditions, we found that the lack of a memory
  barrier in bhyve's virtio handling causes it to miss a NO_NOTIFY state
  transition on block devices, resulting in guest stall. The investigation
  is recorded in OS-7613. As part of the examination into bhyve's use of
  barriers, one other section was found to be problematic, but only on
  non-x86 ISAs with less strict memory ordering. That was addressed in
  this patch as well, although it was not at all a problem on x86.

  PR:		231117
  Submitted by:	Patrick Mooney <patrick.mooney@joyent.com>
  Reviewed by:	jhb, kib, rgrimes
  Approved by:	jhb
  MFC after:	3 days
  Differential Revision:	https://reviews.freebsd.org/D19501

Changes:
  head/usr.sbin/bhyve/virtio.c
Comment 28 commit-hook freebsd_committer 2019-05-23 18:49:17 UTC
A commit references this bug:

Author: rgrimes
Date: Thu May 23 18:48:47 UTC 2019
New revision: 348186
URL: https://svnweb.freebsd.org/changeset/base/348186

Log:
  MFC: r347960: bhyve virtio needs barriers

  Under certain tight race conditions, we found that the lack of a memory
  barrier in bhyve's virtio handling causes it to miss a NO_NOTIFY state
  transition on block devices, resulting in guest stall. The investigation
  is recorded in OS-7613. As part of the examination into bhyve's use of
  barriers, one other section was found to be problematic, but only on
  non-x86 ISAs with less strict memory ordering. That was addressed in
  this patch as well, although it was not at all a problem on x86.

  PR:		231117
  Submitted by:	Patrick Mooney <patrick.mooney@joyent.com>
  Reviewed by:	jhb, kib, rgrimes
  Approved by:	jhb
  Differential Revision:	https://reviews.freebsd.org/D19501

Changes:
_U  stable/12/
  stable/12/usr.sbin/bhyve/virtio.c
Comment 29 commit-hook freebsd_committer 2019-05-23 18:58:33 UTC
A commit references this bug:

Author: rgrimes
Date: Thu May 23 18:58:06 UTC 2019
New revision: 348188
URL: https://svnweb.freebsd.org/changeset/base/348188

Log:
  MFC: r347960: bhyve virtio needs barriers

  Under certain tight race conditions, we found that the lack of a memory
  barrier in bhyve's virtio handling causes it to miss a NO_NOTIFY state
  transition on block devices, resulting in guest stall. The investigation
  is recorded in OS-7613. As part of the examination into bhyve's use of
  barriers, one other section was found to be problematic, but only on
  non-x86 ISAs with less strict memory ordering. That was addressed in
  this patch as well, although it was not at all a problem on x86.

  PR:		231117
  Submitted by:	Patrick Mooney <patrick.mooney@joyent.com>
  Reviewed by:	jhb, kib, rgrimes
  Approved by:	re (gjb), jhb
  Differential Revision:	https://reviews.freebsd.org/D19501

Changes:
_U  stable/11/
  stable/11/usr.sbin/bhyve/virtio.c