We're running bhyve VMs on ~10-20 hypervisors. Since upgrading everything to 11.2-RELEASE last week, we've seen i/o lockups inside different VMs on multiple of these hypervisors. The VMs also run FreeBSD 11.2, and both the hypervisors and the VMs run ZFS for backup and quota reasons. We havn't had any issues on 11.1, but when we ran 11.0, we had some problems with the ARC inside the VMs. The VM would be unable to make any disk i/o as we experience now as well, but in 11.0 the problem could be "resolved" by launching a memory hungry bin from tmpfs, causing ARC to release some memory, and then we got working i/o again. On 11.0 the affected VMs were generally i/o intensive VMs. Now we see random lockups on different workloads. The zvol timestamp on the hypervisors are not updated, so clearly writes fail to go through. Does anyone have some advise for debugging this issue further?
We've tried to downgrade a couple of VMs to 11.1. It seems to "resolve" the issue.
Hi Niels, being more of an operator than a developer I don't have any insight into the root cause of this, but we run a ton of jails and a couple of bhyve VMs on FreeBSD in production and we found that * The automatic adjustment of ARC size under memory pressure is still not working perfectly. * It helps to statically limit the ARC size (vfs.zfs.arc_max) on the host system - we use a quarter of physical memory, so VMs have some breathing room. * I would definitely limit the ARC size inside each VM * Are you running the virtual disks as AHCI or virtio? We use virtio throughout. HTH Patrick
As this seems to happen on many VMs, it might be this issue #229614, which happens on AWS. So it's probably not because of the hypervisor. Does it happen during nightly jobs? Does raising kern.maxvnodes make it go away/less frequent?
(In reply to Michael Gmelin from comment #3) See #229614
Hi Michael, Thanks for the update, I'll look into #229614 and try to establish a test environment. We've been forced to downgrade, in order to keep production running. It does seem that downgrade on the VM, not the hypervisor, solves the issue. From my brief scanning of the other issue, it appears that it would atleast appear that it could be the same issue. We downgraded all affected vms to newest 11.1-RELEASE.
So, after a bit of testing, regretably a fair amount of it in production: It would seem that boosting vnodes helped initially, though eventually it didn't matter. Some systems run without problems, it seems the more memory, the more stable. I have no conclusive evidence of this though. Others seem to die increasingly frequently over time. Eventually getting to the point where the system is inresponsive about 30 minutes after last forced reboot. Based on this, I have a hard time figuring out if it's related to the other issue. As I am unable to log in, once the condition occurs, getting any sort of debug output is tricky as well. Downgrading to 11.1 fixes the issue. I'll stay here for a bit, wait for #229614 to get resolved and ported to 11.2, after which I'll give it another go.
A commit references this bug: Author: allanjude Date: Tue Dec 11 19:34:25 UTC 2018 New revision: 341828 URL: https://svnweb.freebsd.org/changeset/base/341828 Log: MFC: r339289: Resolve a hang in ZFS during vnode reclaimation This is caused by a deadlock between zil_commit() and zfs_zget() Add a way for zfs_zget() to break out of the retry loop in the common case PR: 229614, 231117 Reported by: grembo, jhb, Andreas Sommer, others Relnotes: yes Sponsored by: Klara Systems Changes: _U stable/11/ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
(In reply to commit-hook from comment #7) Is this already in 12.0-RELEASE ? As sadly, we tested 12.0-RELEASE today, with no success. We still get complete I/O lockup after a few hours.
A commit references this bug: Author: emaste Date: Wed Dec 19 18:05:50 UTC 2018 New revision: 342226 URL: https://svnweb.freebsd.org/changeset/base/342226 Log: MFS11 r341828: Resolve a hang in ZFS during vnode reclaimation This is caused by a deadlock between zil_commit() and zfs_zget() Add a way for zfs_zget() to break out of the retry loop in the common case PR: 229614, 231117 Submitted by: allanjude Approved by: so Security: FreeBSD-EN-18:18.zfs Sponsored by: Klara Systems, The FreeBSD Foundation Changes: _U releng/11.2/ releng/11.2/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
(In reply to Kristian K. Christensen from comment #8) Can you provide the `procstat -kk <pid>` output of the processes that hang?
(In reply to Allan Jude from comment #10) I'll give it a shot, but expect it to take a little while to get the output. Let me quickly explain: Currently, the only thing that stops doing io is our bhyve instances running anything 11.2 or newer. (11.1 rarely ever have issues, though it *does* happen, see Niels' comment on how it differs). However, it's the entire instance that stops doing io. The zvol simply receives no updates from here on out. Regretably, on 11.2+, it's mostly on production vms we can effectively reproduce it often enough to debug on it, so we tend to roll them back after testing. I can reproduce on test vms, but they crash more infrequently and thus it could be a long wait. Furthermore, ALL writes stop and any process that wants to write to anything, stalls. This easily gives me a pid to debug with, but very little access to run commands on the vm. The only fix is to kill the bhyve pid and start it up again. We can run 11.2+ on the hypervisor, it's only running 11.2+ inside the VM that causes the hang. VMs running UFS do not hang. I'm going to leave a couple of sessions logged in and run procstat and ps fairly often, to keep them in memory, but I suspect it can take me a little bit to get the required output. I'll get back to you, I'm going to assume zfskern is the most interesting pid to get output from, but if any random that can't write will do, let me know.
(In reply to Kristian K. Christensen from comment #11) So, I can reproduce the issue on a test vm, regretably I can't get ps output or run procstat, they both hang. On the hypervisor, I get the following from procstat -kk on the bhyve pid: machiavelli# procstat -kk 35743 PID TID COMM TDNAME KSTACK 35743 100908 bhyve mevent mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 kqueue_kevent+0x290 kern_kevent+0xb5 sys_kevent+0xa4 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101987 bhyve vtnet-2:0 tx mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101988 bhyve blk-3:0-0 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101989 bhyve blk-3:0-1 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101990 bhyve blk-3:0-2 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101991 bhyve blk-3:0-3 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101992 bhyve blk-3:0-4 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101993 bhyve blk-3:0-5 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101994 bhyve blk-3:0-6 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101995 bhyve blk-3:0-7 mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0xa38 fast_syscall_common+0x101 35743 101996 bhyve vcpu 0 mi_switch+0xe6 sleepq_timedwait+0x2f msleep_spin_sbt+0x161 vm_run+0x7d7 vmmdev_ioctl+0x85f devfs_ioctl_f+0x128 kern_ioctl+0x26d sys_ioctl+0x15c amd64_syscall+0xa38 fast_syscall_common+0x101 The system isn't completely dead, but anything at all that requires io hangs indefinitely. I can type on the console, provided it was logged in beforehand, but trying to tab complete or run anything that's not in memory, hangs. I'm going to try to run procstat and ps more frequently in cron, hoping to keep them in memory. Any other suggestions as to how to provide you with any useful input would be welcomed.
(In reply to Kristian K. Christensen from comment #12) So success - procstats below. Is there anything else that would be helpful? zfskern: PID TID COMM TDNAME KSTACK 5 100034 zfskern arc_reclaim_thread mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c arc_reclaim_thread+0x286 fork_exit+0x83 fork_trampoline+0xe 5 100035 zfskern arc_dnlc_evicts_thr mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e arc_dnlc_evicts_thread+0xff fork_exit+0x83 fork_trampoline+0xe 5 100037 zfskern dbuf_evict_thread mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c dbuf_evict_thread+0x127 fork_exit+0x83 fork_trampoline+0xe 5 100054 zfskern l2arc_feed_thread mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c l2arc_feed_thread+0x2e9 fork_exit+0x83 fork_trampoline+0xe 5 100302 zfskern trim zroot mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c trim_thread+0xab fork_exit+0x83 fork_trampoline+0xe 5 100320 zfskern txg_thread_enter mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e txg_quiesce+0x17b txg_quiesce_thread+0x166 fork_exit+0x83 fork_trampoline+0xe 5 100321 zfskern txg_thread_enter mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e txg_sync_thread+0x1cb fork_exit+0x83 fork_trampoline+0xe 5 100322 zfskern solthread 0xfffffff mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zthr_procedure+0x9b fork_exit+0x83 fork_trampoline+0xe 5 100323 zfskern solthread 0xfffffff mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zthr_procedure+0x9b fork_exit+0x83 fork_trampoline+0xe pkg upgrade -fy (what got it to lock up this time): PID TID COMM TDNAME KSTACK 15809 100406 pkg - mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 kern_wait6+0x4cd sys_wait4+0x78 amd64_syscall+0xa38 fast_syscall_common+0x101 (the second process): PID TID COMM TDNAME KSTACK 15810 100344 pkg - mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zil_commit_impl+0xdeb zfs_freebsd_fsync+0xa7 VOP_FSYNC_APV+0x82 kern_fsync+0x1c3 amd64_syscall+0xa38 fast_syscall_common+0x101
(In reply to Kristian K. Christensen from comment #13) The arc_reclaim thread suggests you might be running very low on memory. Have you turned your vfs.zfs.arc_max to accommodate for the memory needed for the VMs? However the fsync/zil_commit_impl() thread looks more like the issue my patch tries to address: #229614
(In reply to Allan Jude from comment #14) In regards to ARC reclaim, this is zfskern INSIDE the VM. It's got a base install and about 1GB arc_max (4GB ram). It's currently only running a small python app I wrote, to generate io (writes 4096 1-4kb files and distributes them, before ultimately deleting them) and then I tend to run pkg upgrade -fy with a fairly large package set, which'll eventually cause it to flop. I'll try to push it to newest 12.0-STABLE. I assumed that your patch was active in 12.0-RELEASE?
(In reply to Kristian K. Christensen from comment #15) So, it seems to trigger a lot quicker on newest 12.0-STABLE, my second run on pkg, which is good news in terms of reproducing I suppose - the bad news is, it doesn't look much different: FreeBSD 12.0-RELEASE-p1 r342369 GENERIC amd64: pkg (again, pkg tripped up): PID TID COMM TDNAME KSTACK 6732 100408 pkg - mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 zil_commit_impl+0x94d zfs_freebsd_fsync+0xa7 VOP_FSYNC_APV+0x7e kern_fsync+0x1a0 amd64_syscall+0x369 fast_syscall_common+0x101 zfskern: PID TID COMM TDNAME KSTACK 5 100038 zfskern arc_reclaim_thread mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a arc_reclaim_thread+0x146 fork_exit+0x83 fork_trampoline+0xe 5 100039 zfskern arc_dnlc_evicts_thr mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 arc_dnlc_evicts_thread+0x16f fork_exit+0x83 fork_trampoline+0xe 5 100041 zfskern dbuf_evict_thread mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a dbuf_evict_thread+0x1c8 fork_exit+0x83 fork_trampoline+0xe 5 100058 zfskern l2arc_feed_thread mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a l2arc_feed_thread+0x219 fork_exit+0x83 fork_trampoline+0xe 5 100307 zfskern trim zroot mi_switch+0xe1 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x17a trim_thread+0x11f fork_exit+0x83 fork_trampoline+0xe 5 100325 zfskern txg_thread_enter mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 txg_quiesce+0x21b txg_quiesce_thread+0x11b fork_exit+0x83 fork_trampoline+0xe 5 100326 zfskern txg_thread_enter mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 txg_sync_thread+0x13b fork_exit+0x83 fork_trampoline+0xe 5 100327 zfskern solthread 0xfffffff mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 zthr_procedure+0xcc fork_exit+0x83 fork_trampoline+0xe 5 100328 zfskern solthread 0xfffffff mi_switch+0xe1 sleepq_wait+0x2c _cv_wait+0x152 zthr_procedure+0xcc fork_exit+0x83 fork_trampoline+0xe
So, after some extensive testing over the holidays: FreeBSD 11.1 with as low as 100mb ARC max, doesn't excert this behavior. FreeBSD 11.2 and 12.0 with 1024 ARC max and lower, excerts this behavior. FreeBSD 11.2 and 12.0 with over 4GB ARC max does not hang. (I need to see how low I can make it go, but so far I've been unable to cause 4GB ARC max to hang) It may seem odd to want to limit ARC, but we have experienced high ARC usage and applications (bhyve VMs and MySQL daemons notably) being killed due to OOM, if we don't enforce ARC limits. This causes a lot of nightly on-call activity and data loss, which is not really something we care for. Furthermore, this is on VMs that use a zvol as primary storage, on a host that has a fairly high ARC limit, so ideally the host handles most of the ARC.
Just had this occur again on a VM running under bhyve with 12.0-STABLE, checked out and compiled 6 days ago (r344917). VM host is running the exact same kernel. The modifications in zfs_znode.c are present, but we still had an issue after the system has been running for a couple of days. VM has >4GB arc_max (so the workaround as described by Kristian doesn't work for us): vfs.zfs.arc_min: 903779840 vfs.zfs.arc_max: 7230238720 Hypervisor: vfs.zfs.arc_min: 8216929792 vfs.zfs.arc_max: 65735438336 Procstat -kk on the bhyve process: root@cloud02:/home/roel # procstat -kk 18178 PID TID COMM TDNAME KSTACK 18178 101261 bhyve mevent mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a kqueue_kevent+0x297 kern_kevent+0xb5 kern_kevent_generic+0x70 sys_kevent+0x61 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101731 bhyve vtnet-2:0 tx mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101732 bhyve blk-3:0:0-0 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101733 bhyve blk-3:0:0-1 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101734 bhyve blk-3:0:0-2 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101735 bhyve blk-3:0:0-3 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101736 bhyve blk-3:0:0-4 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101737 bhyve blk-3:0:0-5 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101738 bhyve blk-3:0:0-6 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101739 bhyve blk-3:0:0-7 mi_switch+0xe2 sleepq_catch_signals+0x405 sleepq_wait_sig+0xf _sleep+0x23a umtxq_sleep+0x133 do_wait+0x427 __umtx_op_wait_uint_private+0x53 amd64_syscall+0x34d fast_syscall_common+0x101 18178 101740 bhyve vcpu 0 <running> 18178 101741 bhyve vcpu 1 mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 18178 101742 bhyve vcpu 2 mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 18178 101743 bhyve vcpu 3 mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 18178 101744 bhyve vcpu 4 mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101 18178 101745 bhyve vcpu 5 mi_switch+0xe2 sleepq_timedwait+0x2f msleep_spin_sbt+0x138 vm_run+0x502 vmmdev_ioctl+0xbed devfs_ioctl+0xad VOP_IOCTL_APV+0x7c vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x26d sys_ioctl+0x15d amd64_syscall+0x34d fast_syscall_common+0x101
I have very similar problem as described in this issue: I/O in guests hangs. I've experienced this with FreeBSD 11.2, 12.0 (both with ZFS inside) and Ubuntu 18.04 (ext4) guests. This started happening after migrating guests from old hypervisor running 12.0 on Xeon, to the new on running CURRENT (r347183) on AMD Epyc. On old hypervisor these VMs were running stable for couple of months. On the new hypervisor there's plenty of free resources. Swap is disabled. Mem: 3761M Active, 1636M Inact, 5802M Wired, 51G Free ARC: 4000M Total, 487M MFU, 3322M MRU, 3456K Anon, 129M Header, 59M Other 2228M Compressed, 3202M Uncompressed, 1.44:1 Ratio vfs.zfs.arc_min: 8215740416 vfs.zfs.arc_max: 52582796492 Procstat from bhyve provess: root@utgard:~ # procstat -kk 95379 PID TID COMM TDNAME KSTACK 95379 101075 bhyve mevent mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 kqueue_kevent+0xa94 kern_kevent_fp+0x95 kern_kevent+0x9f kern_kevent_generic+0x70 sys_kevent+0x61 amd64_syscall+0x276 fast_syscall_common+0x101 95379 101258 bhyve blk-4:0-0 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101259 bhyve blk-4:0-1 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101260 bhyve blk-4:0-2 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101261 bhyve blk-4:0-3 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101262 bhyve blk-4:0-4 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101263 bhyve blk-4:0-5 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101264 bhyve blk-4:0-6 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101265 bhyve blk-4:0-7 mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101266 bhyve vtnet-5:0 tx mi_switch+0x174 sleepq_switch+0x110 sleepq_catch_signals+0x3e7 sleepq_wait_sig+0xf _sleep+0x2d0 umtxq_sleep+0x153 do_wait+0x206 __umtx_op_wait_uint_private+0x7e amd64_syscall+0x276 fast_syscall_common+0x101 95379 101267 bhyve vcpu 0 mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f msleep_spin_sbt+0x144 vm_run+0x970 vmmdev_ioctl+0x7ea devfs_ioctl+0xca VOP_IOCTL_APV+0x63 vn_ioctl+0x124 devfs_ioctl_f+0x1f kern_ioctl+0x28a sys_ioctl+0x15d amd64_syscall+0x276 fast_syscall_common+0x101 95379 101268 bhyve vcpu 1 mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f msleep_spin_sbt+0x144 vm_run+0x970 vmmdev_ioctl+0x7ea devfs_ioctl+0xca VOP_IOCTL_APV+0x63 vn_ioctl+0x124 devfs_ioctl_f+0x1f kern_ioctl+0x28a sys_ioctl+0x15d amd64_syscall+0x276 fast_syscall_common+0x101 I will be happy to provide more information to help solving this issue.
Output of procstat -kk -a | grep zfs: # procstat -kk -a | grep zfs 0 100403 kernel zfsvfs mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _sleep+0x2da taskqueue_thread_loop+0xe1 fork_exit+0x84 fork_trampoline+0xe 0 100771 kernel zfs_vn_rele_taskq mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _sleep+0x2da taskqueue_thread_loop+0xe1 fork_exit+0x84 fork_trampoline+0xe 46 100365 zfskern solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 zthr_procedure+0x12c fork_exit+0x84 fork_trampoline+0xe 46 100366 zfskern solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 zthr_procedure+0x12c fork_exit+0x84 fork_trampoline+0xe 46 100367 zfskern arc_dnlc_evicts_thr mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b arc_dnlc_evicts_thread+0x133 fork_exit+0x84 fork_trampoline+0xe 46 100369 zfskern dbuf_evict_thread mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 dbuf_evict_thread+0x296 fork_exit+0x84 fork_trampoline+0xe 46 100402 zfskern l2arc_feed_thread mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 l2arc_feed_thread+0x1d5 fork_exit+0x84 fork_trampoline+0xe 46 100698 zfskern trim zroot mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 trim_thread+0xab fork_exit+0x84 fork_trampoline+0xe 46 100812 zfskern txg_thread_enter mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b txg_thread_wait+0x9a txg_quiesce_thread+0x2a6 fork_exit+0x84 fork_trampoline+0xe 46 100813 zfskern txg_thread_enter mi_switch+0x174 sleepq_switch+0x110 sleepq_timedwait+0x4f _cv_timedwait_sbt+0x185 txg_thread_wait+0x8d txg_sync_thread+0x53b fork_exit+0x84 fork_trampoline+0xe 46 100814 zfskern solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b zthr_procedure+0xeb fork_exit+0x84 fork_trampoline+0xe 46 100815 zfskern solthread 0xfffffff mi_switch+0x174 sleepq_switch+0x110 sleepq_wait+0x43 _cv_wait+0x15b zthr_procedure+0xeb fork_exit+0x84 fork_trampoline+0xe
(In reply to Mateusz Kwiatkowski from comment #19) > On the new hypervisor there's plenty of free resources. Swap is disabled. > > Mem: 3761M Active, 1636M Inact, 5802M Wired, 51G Free > ARC: 4000M Total, 487M MFU, 3322M MRU, 3456K Anon, 129M Header, 59M Other > 2228M Compressed, 3202M Uncompressed, 1.44:1 Ratio > > vfs.zfs.arc_min: 8215740416 > vfs.zfs.arc_max: 52582796492 How much memory are your VM's using? Ie, what is the total of all the VM's you have running at any time? How much total memory does the system have? Your arc max is 52Gbytes, 4G is in use now, so it could try to use another 48G bytes, your system has 51G free under the above conditions so the arc could drive your system to 3G of free memory. I am concerned your host work load my at times drive your system into memory starvation unless these numbers are actually with all the system under load.
(In reply to Rodney W. Grimes from comment #21) > How much memory are your VM's using? Ie, what is the total of all the VM's you have > running at any time? How much total memory does the system have? VMs may use 8GBs of RAM in total. Hypervisor has 64GB in total. So VMs and ARC could create OOM condition but the statistics I provided were taken at the moment when one of the VMs hang. I'll reduce arc_max anyway to 8GB to make sure this isn't the cause.
(In reply to Mateusz Kwiatkowski from comment #22) > VMs may use 8GBs of RAM in total. Hypervisor has 64GB in total. So VMs and ARC could create OOM condition but the statistics I provided were taken at themoment when one of the VMs hang. I'll reduce arc_max anyway to 8GB to make sure this isn't the cause. Not sure you need to cut arc_max to 8GB, but just make sure that arc_max + total(all VM's) + host usage is < than 64GB. I have been pointed at: If the guest(s) are using virtio-block, it might be interesting to check the status of the IO queues for those devices. The reported issues could be a case of https://smartos.org/bugview/OS-7613. (Posted as https://reviews.freebsd.org/D19501) I have pinged the D19501 review and it looks like some forward motion has happened there. You *may* if you can want to try the patch that is there now, knowing that it is about to be revised.
(In reply to Rodney W. Grimes from comment #23) Rodney, many thanks for your engagement in this matter. It seems that you were right suspecting that arc_max is the culprit. After reducing its size virtual machines are working stable for over a day. So, instead of stability issue this was more like user experience issue. As you see we both were tricked by top(1) output which looked fine and didn't indicate problem. Is it possible to have a notice printed in dmesg that happens (as we have for other kernel limits)?
(In reply to Mateusz Kwiatkowski from comment #24) I'm afraid I celebrated too soon. I/O in VM locked up - it just took more time. Memory stats: Mem: 3846M Active, 4806M Inact, 26G Wired, 27G Free ARC: 7755M Total, 2419M MFU, 4208M MRU, 14M Anon, 478M Header, 635M Other 6145M Compressed, 10G Uncompressed, 1.68:1 Ratio I'll now test patch with virtio barriers.
(In reply to Rodney W. Grimes from comment #23) The VM running patched bhyve is stable since Sunday evening. This is 3 whole days and it didn't run so long without a crash before. I think we can assume that patch did the job!
A commit references this bug: Author: rgrimes Date: Sat May 18 19:32:39 UTC 2019 New revision: 347960 URL: https://svnweb.freebsd.org/changeset/base/347960 Log: bhyve virtio needs barriers Under certain tight race conditions, we found that the lack of a memory barrier in bhyve's virtio handling causes it to miss a NO_NOTIFY state transition on block devices, resulting in guest stall. The investigation is recorded in OS-7613. As part of the examination into bhyve's use of barriers, one other section was found to be problematic, but only on non-x86 ISAs with less strict memory ordering. That was addressed in this patch as well, although it was not at all a problem on x86. PR: 231117 Submitted by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: jhb, kib, rgrimes Approved by: jhb MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D19501 Changes: head/usr.sbin/bhyve/virtio.c
A commit references this bug: Author: rgrimes Date: Thu May 23 18:48:47 UTC 2019 New revision: 348186 URL: https://svnweb.freebsd.org/changeset/base/348186 Log: MFC: r347960: bhyve virtio needs barriers Under certain tight race conditions, we found that the lack of a memory barrier in bhyve's virtio handling causes it to miss a NO_NOTIFY state transition on block devices, resulting in guest stall. The investigation is recorded in OS-7613. As part of the examination into bhyve's use of barriers, one other section was found to be problematic, but only on non-x86 ISAs with less strict memory ordering. That was addressed in this patch as well, although it was not at all a problem on x86. PR: 231117 Submitted by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: jhb, kib, rgrimes Approved by: jhb Differential Revision: https://reviews.freebsd.org/D19501 Changes: _U stable/12/ stable/12/usr.sbin/bhyve/virtio.c
A commit references this bug: Author: rgrimes Date: Thu May 23 18:58:06 UTC 2019 New revision: 348188 URL: https://svnweb.freebsd.org/changeset/base/348188 Log: MFC: r347960: bhyve virtio needs barriers Under certain tight race conditions, we found that the lack of a memory barrier in bhyve's virtio handling causes it to miss a NO_NOTIFY state transition on block devices, resulting in guest stall. The investigation is recorded in OS-7613. As part of the examination into bhyve's use of barriers, one other section was found to be problematic, but only on non-x86 ISAs with less strict memory ordering. That was addressed in this patch as well, although it was not at all a problem on x86. PR: 231117 Submitted by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: jhb, kib, rgrimes Approved by: re (gjb), jhb Differential Revision: https://reviews.freebsd.org/D19501 Changes: _U stable/11/ stable/11/usr.sbin/bhyve/virtio.c
The bhyve chatroom reports that the memory barrier fix seems to solve this issue. Should this PR be closed now?
(In reply to Allan Jude from comment #30) Hi Allan, I can confirm, that we are no longer experiencing this issue, neither in 11.3-RELEASE nor in 12.1-RELEASE (though 11.2-RELEASE and 12.0-RELEASE are still affected, I assume there's no plan to migrate this fix, for obvious reasons ;)) Thank you so much for your effort - you and everyone else who worked on this.
Fixed in 11.3 and 12.1, so there is no need to keep the bug open from what I can see. Please reopen if you disagree.