Summary: | ZFS lockup in zil_commit_impl | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Andreas Sommer <andreas.sommer87> | ||||
Component: | kern | Assignee: | Allan Jude <allanjude> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Many People | CC: | allanjude, avg, cperciva, eborisch+FreeBSD, flo, fs, grembo, jo, mandree, markj, michal.kulik3, nivedita, olevole, olgeni, pi, pjd, richard, sigsys, simon, vi | ||||
Priority: | Normal | Keywords: | regression | ||||
Version: | 11.2-RELEASE | Flags: | koobs:
mfc-stable11+
|
||||
Hardware: | amd64 | ||||||
OS: | Any | ||||||
URL: | https://reviews.freebsd.org/D17460 | ||||||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231117 | ||||||
Attachments: |
|
Description
Andreas Sommer
2018-07-08 20:42:53 UTC
Adding Colin in case this might be AWS related. Which EC2 instance type is this? (In reply to Colin Percival from comment #2) t2.xlarge Copyright (c) 1992-2018 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.2-RELEASE #0 r335510: Fri Jun 22 04:32:14 UTC 2018 root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 6.0.0 (tags/RELEASE_600/final 326565) (based on LLVM 6.0.0) VT(vga): text 80x25 XEN: Hypervisor version 4.2 detected. CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz (2394.50-MHz K8-class CPU) Origin="GenuineIntel" Id=0x306f2 Family=0x6 Model=0x3f Stepping=2 Features=0x1783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT> Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x21<LAHF,ABM> Structured Extended Features=0x7a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID> XSAVE Features=0x1<XSAVEOPT> Hypervisor: Origin = "XenVMMXenVMM" real memory = 17179869184 (16384 MB) avail memory = 16600379392 (15831 MB) Event timer "LAPIC" quality 100 ACPI APIC Table: <Xen HVM> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 1 package(s) x 4 core(s) ioapic0: Changing APIC ID to 1 MADT: Forcing active-low polarity and level trigger for SCI ioapic0 <Version 1.1> irqs 0-47 on motherboard SMP: AP CPU #3 Launched! SMP: AP CPU #2 Launched! SMP: AP CPU #1 Launched! It's also noteworthy that the zpool was a bit behind before the hang happened (just upgraded, no reboot yet): # zpool upgrade tank This system supports ZFS pool feature flags. Enabled the following features on 'tank': sha512 skein device_removal obsolete_counts zpool_checkpoint And that it's using ashift=9: # zdb tank: version: 5000 name: 'tank' state: 0 txg: 4612743 pool_guid: 819174864231115292 hostid: 680786551 hostname: '' com.delphix:has_per_vdev_zaps vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 819174864231115292 children[0]: type: 'disk' id: 0 guid: 1449467969842353068 path: '/dev/ada0p3' phys_path: '/dev/ada0p3' whole_disk: 1 metaslab_array: 34 metaslab_shift: 26 ashift: 9 asize: 159982288896 is_log: 0 create_txg: 4 com.delphix:vdev_zap_leaf: 4143 com.delphix:vdev_zap_top: 4144 features_for_read: com.delphix:hole_birth com.delphix:embedded_data Maybe this is also related to bug #219224 "procstat -kk -a" crashed the instance (rebooted). (In reply to Michael Gmelin from comment #5) I am confident that it's the same bug. I think that it is a deadlock between a thread like the one from comment #0 and a thread like this: zfs_zget+0x1b4 zfs_get_data+0x7c zil_commit+0x926 zfs_sync+0xad sys_sync+0xe6 amd64_syscall+0x2f6 Xfast_syscall+0xfb The first thread is waiting on zil_commit in the second thread to complete. At the same time the second thread is waiting (in zfs_zget) on the first thread to finish reclaiming the vnode. I think I understand the problem, but I do not see an easy way to fix it. Checking the vnode state in zil_commit -> zfs_get_data -> zfs_zget is problematic as zil_commit can be called from vnode operations while an arbitrary other vnode is locked. (In reply to Andriy Gapon from comment #7) Given this fact, do you think that there are 1. Ways to prevent such an interlocked in production (e.g. by avoiding certain operations) 2. Ways an unprivileged user could use this to break a system 3. Ways to still use 11.2 with ZFS in production 4. Reasons to add this to the release errata section Thanks, Michael Saw this again just now, after having reinstalled the whole instance on July 10th (12 days "incubation time" :P). Again, I could see the steady CPU usage (spinlock) in AWS CloudWatch monitoring graphs. Doing `procstat -kk` on the buildbot master process gave `procstat: sysctl: kern.proc.kstack: 67301: Device busy`, while `procstat -kka` made the system unavailable and rebooted. Thus, no further info collected apart from the knowledge that this also happens for ashift=12 (originally I had an ashift=9 pool on the old instance which had this problem every few days). Anything more we can do to debug/resolve this? I have the same problem, does anyone know the solution? (In reply to Mike from comment #10) As a workaround, try if setting kern.maxvnodes to a high number (it's usually quite low on small VMs) helps, e.g. sysctl kern.maxvnodes=2000000 use sysrc -f /etc/sysctl.conf kern.maxvnodes=2000000 to set it permanently. (In reply to Michael Gmelin from comment #11) But I don't use VPS/Cloud/AWS. I have a dedicated server and kern.maxvnodes=2215782. (In reply to Mike from comment #12) Interesting, I haven't seen the problem on bare metal yet (we have 4 million there which is plenty for our workloads). Check sysctl vfs.numvnodes under your typical workloads and set kern.maxvnodes so that it's twice as much(even if it works this is just a workaround, the underlying issue absolutely has to be fixed). Can you reproduce the problem reliably, or does it just happen once in a while (e.g. during nightly periodic jobs)? (In reply to Michael Gmelin from comment #13) I have started monitoring vfs.numvnodes and it indeed sometimes almost fills up the "2215782" setting of kern.maxvnodes. Do you suggest raising the latter to 4431564 (twice as much) or would this be too high and could cause potential issues? (In reply to Mike from comment #14) As long as you're not memory constrained raising kern.maxvnodes shouldn't cause any issues. If anything, it could even increase disk i/o performance (we upped small vms that would by default have 200k to 1000000 kern.maxvnodes without any issues). It would be interesting if you could share if the adjustments improved your situation as well as the kind of workloads that cause interlocks in your setup right now to help create a test case that reproduces the problem reliably. Hi all, within the last weeks I've managed to strip my host's workload down to a reproducible example of the interlock, and here it is: https://github.com/AndiDog/zfs-lockup Please have a look at the `README` and `scripts/all.sh` files at least to understand the very simple setup. Sure it can be reproduced with even simpler tricks. I talked to Allan Jude (CC) at EuroBSDcon and he thought that any synchronous write might trigger the issue (and had some ideas on debugging it) – for me, the buildbot master did a good job of triggering the interlock by writing lots of stuff to its SQLite database. It would be great if some ZFS experts could take a look into this, now that we have a way to reproduce it. Since several people already claimed to have this problem, it remains a pretty risky issue for production. (P.S. if you find non-best-practice things in the setup, I'm interested to hear about it by short e-mail or @AndiDog87, just for the sake of learning) (In reply to Michael Gmelin from comment #15) We are getting lockups on a 640T fileserver as well. About once a week. I currently have: # sysctl kern.maxvnodes kern.maxvnodes: 22999999 # sysctl vfs.numvnodes vfs.numvnodes: 22999998 (We've been increasing it a lot and monitoring RAM usage) Been repeatedly hitting this over the past month. I noticed correlated issues with the CPU temperature monitoring, but I think those were a red herring. One thing I can say for sure: turning off hyperthreading made this take a LOT longer to happen. My best guess is that it's a race condition with locking, so the more cores you have, the more likely you are to hit it. Notably, this IS happening on baremetal for me, though it's in a jail. This is the line that the _sx_xlock_hard call that's hanging is on: https://github.com/freebsd/freebsd/blob/release/11.2.0/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c#L2190 (beginning of zil_commit_writer). That means that the issuer lock is being held by something. Notably, the issuer lock can be held while waiting on a txg to sync in zil_commit_waiter_timeout (or more precisely, its call to zil_commit_writer_stall). I'm not entirely sure (since I'm unfamiliar with this code), but I think this may be the source of the deadlock. Upon inspection, both the quiesce and sync threads appear to be idle, and I can't find any evidence that there's any other thread that's in that function, so I'm not sure that's what's going on. Plus I have no idea what's going on with the un-ptraceable process as mentioned in comment 9, which I've observed too. It appears to be livelocked on my system (100% of one thread at all times), but I can't inspect it at all--even kgdb can't touch it. in my recreation: 55649 100482 python3.6 - mi_switch+0xe6 sleepq_wait+0x2c _sx_xlock_hard+0x306 zil_commit_impl+0x11d zfs_freebsd_putpages+0x635 VOP_PUTPAGES_APV+0x82 vnode_pager_putpages+0x8e vm_pageout_flush+0xea vm_object_page_collect_flush+0x213 vm_object_page_clean+0x146 vm_object_terminate+0x93 zfs_freebsd_reclaim+0x1e VOP_RECLAIM_APV+0x82 vgonel+0x208 vrecycle+0x4a zfs_freebsd_inactive+0xd VOP_INACTIVE_APV+0x82 vinactive+0xfc the sync process does seem to be making doing work, but not making progress: sampled 10000 times: NUM PID TID COMM TDNAME KSTACK 1 56618 100457 sync - mi_switch critical_exit intr_event_handle intr_execute_handlers lapic_handle_intr Xapic_isr1 dbuf_read dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 1 56618 100457 sync - mi_switch critical_exit ipi_bitmap_handler Xipi_intr_bitmap_handler VOP_ISLOCKED_APV zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 1 56618 100457 sync - mi_switch critical_exit ipi_bitmap_handler Xipi_intr_bitmap_handler zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 23 56618 100457 sync - VOP_ISLOCKED_APV zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 112 56618 100457 sync - arc_access arc_buf_access dbuf_hold_impl dbuf_hold dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 118 56618 100457 sync - dnode_rele dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 148 56618 100457 sync - vref zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 150 56618 100457 sync - dbuf_find dbuf_hold_impl dbuf_hold dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 248 56618 100457 sync - dbuf_hold dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 263 56618 100457 sync - vputx zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 273 56618 100457 sync - dmu_object_info_from_dnode dmu_object_info_from_db zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 292 56618 100457 sync - zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 428 56618 100457 sync - <running> 458 56618 100457 sync - arc_buf_access dbuf_hold_impl dbuf_hold dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 572 56618 100457 sync - dmu_object_info_from_db zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 671 56618 100457 sync - dbuf_read dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 690 56618 100457 sync - dbuf_read dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 937 56618 100457 sync - dbuf_hold_impl dbuf_hold dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 1272 56618 100457 sync - dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 1433 56618 100457 sync - dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common 1909 56618 100457 sync - zfs_zget zfs_get_data zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common This patch has been tested in my reproduction setup and prevents the hang. I am not confident in the lack of locking before reading zp->z_unlinked yet though. Waiting on review from those more familiar with VFS locking than myself. However, if you feel comfortable testing this patch, please do: https://reviews.freebsd.org/D17460 I've just recently upgraded my HPE Microserver Gen8 to 11.2 ( bare metal) and I appear to be consistently hitting this bug every night during backups. Testing the patch now to see if it helps my issue. A commit references this bug: Author: allanjude Date: Wed Oct 10 19:39:47 UTC 2018 New revision: 339289 URL: https://svnweb.freebsd.org/changeset/base/339289 Log: Resolve a hang in ZFS during vnode reclaimation This is caused by a deadlock between zil_commit() and zfs_zget() Add a way for zfs_zget() to break out of the retry loop in the common case PR: 229614 Reported by: grembo, Andreas Sommer, many others Tested by: Andreas Sommer, Vicki Pfau Reviewed by: avg (no objection) Approved by: re (gjb) MFC after: 2 months Sponsored by: Klara Systems Differential Revision: https://reviews.freebsd.org/D17460 Changes: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c (In reply to Simon Coggins from comment #22) Please ignore my comment. My issue was ZFS Dedup/ARC/Bhyve memory exhaustion issue. Disabling dedup and destructing mac ARC solved my issue so it was unrelated to this issue/fix. (In reply to commit-hook from comment #23) > MFC after: 2 months Is this safe? I keep having issues on most of my 11.2 boxes :| Should I patch the kernel manually in the meantime? I recommend everyone having the described problem and similar `procstat -kk` stacktrace to patch their kernel manually for now. For our development CI host, I did just that (built and replaced kernel only) and it works stably for some days now *without* the `kern.maxvnodes=HIGHNUMBER` workaround. The same applies to my reproduction setup (see GitHub link, posted above) – previously it took a few seconds to reproduce the issue, and after replacing the kernel, no more problems. However, if you have 24/7 services that need to keep running, probably use the workaround for now and wait for a dot release which will then anyhow require a reboot. 2 months actually sounds long enough to give the safety of many people trying and confirming the change. It will also immediately make it from head to stable/12 according to https://www.freebsd.org/releases/12.0R/schedule.html, so I'm happy. Issue appeared again, using the patched kernel: PID TID COMM TDNAME KSTACK [...] 1729 100862 python3.6 - mi_switch+0xe6 sleepq_wait+0x2c _sx_xlock_hard+0x306 zil_commit_impl+0x11d zfs_freebsd_putpages+0x635 VOP_PUTPAGES_APV+0x82 vnode_pager_putpages+0x8e vm_pageout_flush+0xea vm_object_page_collect_flush+0x213 vm_object_page_clean+0x146 vm_object_terminate+0x93 zfs_freebsd_reclaim+0x1e VOP_RECLAIM_APV+0x82 vgonel+0x208 vrecycle+0x4a zfs_freebsd_inactive+0xd VOP_INACTIVE_APV+0x82 vinactive+0xfc After applying the kernel patch on Oct 8th, the problem occurred on 22nd. Better interval than before, but here we are again. Could you give me some pointers how to debug this myself? (In reply to Andreas Sommer from comment #27) Did you see comment #7 ? (In reply to Andriy Gapon from comment #28) Saw that and tried `procstat -kk` on other processes, but as found previously, it would render the complete system unusable if executed on certain processes. Didn't find something obvious within the same process (that does `zil_commit_impl` in one thread). The committing process is Buildbot, and I saw that it happened during a nightly run of periodic scripts (which go through whole filesystem). Now I have to reboot the instance... That's why a non-destructive debugging method might be more helpful to see what other interlock we have between processes/threads. (In reply to Andreas Sommer from comment #29) Next time, can you panic the hung system to get a crash dump? sysctl debug.kdb.panic=1 (In reply to Andreas Sommer from comment #29) What is your maxvnodes set to now? So you were not able to find the process that was racing against buildbot (python) and got stuck in zfs_zget() ? (In reply to Allan Jude from comment #31) I have `kern.maxvnodes: 350367` now. Tried to find the racing process but hang the system using procstat – will try your suggestion next time around. Seems I can test this every few days – that's how often it still happens. Did the kernel dump this time. Does any of this help? (kgdb) info threads * 414 Thread 100499 (PID=95019: sysctl) 0xffffffff80af691b in doadump () 413 Thread 101525 (PID=94981: csh) 0xffffffff80b25edd in sched_switch () 412 Thread 100778 (PID=94980: su) 0xffffffff80b25edd in sched_switch () 411 Thread 100917 (PID=94979: sudo) 0xffffffff80b25edd in sched_switch () 410 Thread 100677 (PID=94926: bash) 0xffffffff80b25edd in sched_switch () 409 Thread 100630 (PID=94925: sshd) 0xffffffff80b25edd in sched_switch () 408 Thread 100613 (PID=94923: sshd) 0xffffffff80b25edd in sched_switch () 407 Thread 101554 (PID=94497: vim) 0xffffffff80b25edd in sched_switch () 406 Thread 101348 (PID=94494: csh) 0xffffffff80b25edd in sched_switch () 405 Thread 101335 (PID=94493: su) 0xffffffff80b25edd in sched_switch () 404 Thread 100497 (PID=94492: sudo) 0xffffffff80b25edd in sched_switch () 403 Thread 101262 (PID=89931: pkg) 0xffffffff80b25edd in sched_switch () 402 Thread 101526 (PID=89930: wc) 0xffffffff80b25edd in sched_switch () 401 Thread 101524 (PID=89929: tee) 0xffffffff80b25edd in sched_switch () 400 Thread 101426 (PID=89928: sed) 0xffffffff80b25edd in sched_switch () 399 Thread 101573 (PID=89927: pkg) 0xffffffff80b25edd in sched_switch () 398 Thread 100628 (PID=89926: sh) 0xffffffff80b25edd in sched_switch () 397 Thread 101123 (PID=89923: sh) 0xffffffff80b25edd in sched_switch () 396 Thread 101403 (PID=89773: mail) 0xffffffff80b25edd in sched_switch () 395 Thread 101513 (PID=89772: sh) 0xffffffff80b25edd in sched_switch () 394 Thread 101520 (PID=89759: sh) 0xffffffff80b25edd in sched_switch () 393 Thread 100680 (PID=89757: lockf) 0xffffffff80b25edd in sched_switch () 392 Thread 101014 (PID=89754: sh) 0xffffffff80b25edd in sched_switch () 391 Thread 100400 (PID=89753: sh) 0xffffffff80b25edd in sched_switch () 390 Thread 100620 (PID=89461: mail) 0xffffffff80b25edd in sched_switch () 389 Thread 100521 (PID=89460: sh) 0xffffffff80b25edd in sched_switch () 388 Thread 101469 (PID=89442: sh) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 387 Thread 100643 (PID=89441: lockf) 0xffffffff80b25edd in sched_switch () 386 Thread 100426 (PID=89435: sh) 0xffffffff80b25edd in sched_switch () 385 Thread 100700 (PID=89432: cron) 0xffffffff80b25edd in sched_switch () 384 Thread 100525 (PID=87040: python3.6) 0xffffffff810f10b8 in cpustop_handler () 383 Thread 100512 (PID=84285: gpg-agent) 0xffffffff80b25edd in sched_switch () 382 Thread 100358 (PID=1257: getty) 0xffffffff80b25edd in sched_switch () 381 Thread 100359 (PID=1256: getty) 0xffffffff80b25edd in sched_switch () 380 Thread 100381 (PID=1255: getty) 0xffffffff80b25edd in sched_switch () 379 Thread 100389 (PID=1254: getty) 0xffffffff80b25edd in sched_switch () 378 Thread 100394 (PID=1253: getty) 0xffffffff80b25edd in sched_switch () 377 Thread 100390 (PID=1252: getty) 0xffffffff80b25edd in sched_switch () 376 Thread 100409 (PID=1251: getty) 0xffffffff80b25edd in sched_switch () 375 Thread 100396 (PID=1250: getty) 0xffffffff80b25edd in sched_switch () 374 Thread 100362 (PID=1249: getty) 0xffffffff80b25edd in sched_switch () 373 Thread 100410 (PID=1205: cron) 0xffffffff80b25edd in sched_switch () 372 Thread 100403 (PID=1191: sendmail) 0xffffffff80b25edd in sched_switch () 371 Thread 100388 (PID=1188: sendmail) 0xffffffff80b25edd in sched_switch () 370 Thread 100402 (PID=1132: cron) 0xffffffff80b25edd in sched_switch () 369 Thread 100374 (PID=1128: sendmail) 0xffffffff80b25edd in sched_switch () 368 Thread 100397 (PID=1125: sendmail) 0xffffffff80b25edd in sched_switch () 367 Thread 100408 (PID=1122: multilog) 0xffffffff80b25edd in sched_switch () 366 Thread 100387 (PID=1121: python3.6) 0xffffffff80b25edd in sched_switch () 365 Thread 100417 (PID=1121: python3.6) 0xffffffff80b25edd in sched_switch () 364 Thread 100418 (PID=1121: python3.6) 0xffffffff80b25edd in sched_switch () 363 Thread 101413 (PID=1121: python3.6) 0xffffffff80b25edd in sched_switch () 362 Thread 100407 (PID=1117: supervise) 0xffffffff80b25edd in sched_switch () 361 Thread 100406 (PID=1116: supervise) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 360 Thread 100405 (PID=1113: readproctitle) 0xffffffff80b25edd in sched_switch () 359 Thread 100384 (PID=1112: svscan) 0xffffffff80b25edd in sched_switch () 358 Thread 100398 (PID=1059: syslogd) 0xffffffff80b25edd in sched_switch () 357 Thread 100364 (PID=965: cron) 0xffffffff80b25edd in sched_switch () 356 Thread 100361 (PID=961: sendmail) 0xffffffff80b25edd in sched_switch () 355 Thread 100395 (PID=958: sendmail) 0xffffffff80b25edd in sched_switch () 354 Thread 100365 (PID=957: python3.6) 0xffffffff80b25edd in sched_switch () 353 Thread 100370 (PID=950: multilog) 0xffffffff80b25edd in sched_switch () 352 Thread 100382 (PID=949: sudo) 0xffffffff80b25edd in sched_switch () 351 Thread 100385 (PID=944: supervise) 0xffffffff80b25edd in sched_switch () 350 Thread 100391 (PID=943: supervise) 0xffffffff80b25edd in sched_switch () 349 Thread 100360 (PID=940: readproctitle) 0xffffffff80b25edd in sched_switch () 348 Thread 100393 (PID=939: svscan) 0xffffffff80b25edd in sched_switch () 347 Thread 100392 (PID=891: syslogd) 0xffffffff80b25edd in sched_switch () 346 Thread 100386 (PID=787: sshd) 0xffffffff80b25edd in sched_switch () 345 Thread 100366 (PID=768: nginx) 0xffffffff80b25edd in sched_switch () 344 Thread 100356 (PID=767: nginx) 0xffffffff80b25edd in sched_switch () 343 Thread 100379 (PID=587: syslogd) 0xffffffff80b25edd in sched_switch () 342 Thread 100377 (PID=512: pf purge) 0xffffffff80b25edd in sched_switch () 341 Thread 100375 (PID=504: devd) 0xffffffff80b25edd in sched_switch () 340 Thread 100372 (PID=498: dhclient) 0xffffffff80b25edd in sched_switch () 339 Thread 100369 (PID=452: dhclient) 0xffffffff80b25edd in sched_switch () 338 Thread 100092 (PID=25: syncer) 0xffffffff80b25edd in sched_switch () 337 Thread 100091 (PID=24: vnlru) 0xffffffff80b25edd in sched_switch () 336 Thread 100090 (PID=23: bufdaemon) 0xffffffff80b25edd in sched_switch () 335 Thread 100089 (PID=22: bufspacedaemon) 0xffffffff80b25edd in sched_switch () 334 Thread 100087 (PID=21: pagezero) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 333 Thread 100085 (PID=20: vmdaemon) 0xffffffff80b25edd in sched_switch () 332 Thread 100084 (PID=19: pagedaemon/dom0) 0xffffffff80b25edd in sched_switch () 331 Thread 100086 (PID=19: pagedaemon/laundry: dom0) 0xffffffff80b25edd in sched_switch () 330 Thread 100088 (PID=19: pagedaemon/uma) 0xffffffff80b25edd in sched_switch () 329 Thread 100080 (PID=18: balloon) 0xffffffff80b25edd in sched_switch () 328 Thread 100074 (PID=17: rand_harvestq) 0xffffffff80b25edd in sched_switch () 327 Thread 100073 (PID=16: sctp_iterator) 0xffffffff80b25edd in sched_switch () 326 Thread 100052 (PID=9: zfskern/arc_reclaim_thread) 0xffffffff80b25edd in sched_switch () 325 Thread 100053 (PID=9: zfskern/arc_dnlc_evicts_thr) 0xffffffff80b25edd in sched_switch () 324 Thread 100055 (PID=9: zfskern/dbuf_evict_thread) 0xffffffff80b25edd in sched_switch () 323 Thread 100072 (PID=9: zfskern/l2arc_feed_thread) 0xffffffff80b25edd in sched_switch () 322 Thread 100323 (PID=9: zfskern/trim zroot) 0xffffffff80b25edd in sched_switch () 321 Thread 100348 (PID=9: zfskern/txg_thread_enter) 0xffffffff80b25edd in sched_switch () 320 Thread 100349 (PID=9: zfskern/txg_thread_enter) 0xffffffff80b25edd in sched_switch () 319 Thread 100350 (PID=9: zfskern/solthread 0xfffffff) 0xffffffff80b25edd in sched_switch () 318 Thread 100351 (PID=9: zfskern/solthread 0xfffffff) 0xffffffff80b25edd in sched_switch () 317 Thread 100051 (PID=8: soaiod4) 0xffffffff80b25edd in sched_switch () 316 Thread 100050 (PID=7: soaiod3) 0xffffffff80b25edd in sched_switch () 315 Thread 100049 (PID=6: soaiod2) 0xffffffff80b25edd in sched_switch () 314 Thread 100048 (PID=5: soaiod1) 0xffffffff80b25edd in sched_switch () 313 Thread 100042 (PID=15: xenstore_rcv) 0xffffffff80b25edd in sched_switch () 312 Thread 100041 (PID=14: xenwatch) 0xffffffff80b25edd in sched_switch () 311 Thread 100034 (PID=4: cam/doneq0) 0xffffffff80b25edd in sched_switch () 310 Thread 100079 (PID=4: cam/scanner) 0xffffffff80b25edd in sched_switch () 309 Thread 100033 (PID=3: crypto returns) 0xffffffff80b25edd in sched_switch () 308 Thread 100032 (PID=2: crypto) 0xffffffff80b25edd in sched_switch () 307 Thread 100028 (PID=13: geom/g_event) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 306 Thread 100029 (PID=13: geom/g_up) 0xffffffff80b25edd in sched_switch () 305 Thread 100030 (PID=13: geom/g_down) 0xffffffff80b25edd in sched_switch () 304 Thread 100008 (PID=12: intr/swi5: fast taskq) 0xffffffff80f58ed0 in fork_trampoline () 303 Thread 100010 (PID=12: intr/swi6: task queue) 0xffffffff80b25edd in sched_switch () 302 Thread 100012 (PID=12: intr/swi6: Giant taskq) 0xffffffff80b25edd in sched_switch () 301 Thread 100022 (PID=12: intr/swi1: netisr 0) 0xffffffff80b25edd in sched_switch () 300 Thread 100023 (PID=12: intr/swi4: clock (0)) 0xffffffff80b25edd in sched_switch () 299 Thread 100024 (PID=12: intr/swi4: clock (1)) 0xffffffff80f58ed0 in fork_trampoline () 298 Thread 100025 (PID=12: intr/swi4: clock (2)) 0xffffffff80f58ed0 in fork_trampoline () 297 Thread 100026 (PID=12: intr/swi4: clock (3)) 0xffffffff80f58ed0 in fork_trampoline () 296 Thread 100027 (PID=12: intr/swi3: vm) 0xffffffff80f58ed0 in fork_trampoline () 295 Thread 100035 (PID=12: intr/irq14: ata0) 0xffffffff80f58ed0 in fork_trampoline () 294 Thread 100036 (PID=12: intr/irq15: ata1) 0xffffffff80f58ed0 in fork_trampoline () 293 Thread 100037 (PID=12: intr/irq1: atkbd0) 0xffffffff80b25edd in sched_switch () 292 Thread 100038 (PID=12: intr/irq12: psm0) 0xffffffff80f58ed0 in fork_trampoline () 291 Thread 100039 (PID=12: intr/swi0: uart) 0xffffffff80b25edd in sched_switch () 290 Thread 100040 (PID=12: intr/irq808: xenstore0) 0xffffffff80b25edd in sched_switch () 289 Thread 100081 (PID=12: intr/irq813: xbd0) 0xffffffff80b25edd in sched_switch () 288 Thread 100083 (PID=12: intr/irq814: xn0) 0xffffffff80b25edd in sched_switch () 287 Thread 100378 (PID=12: intr/swi1: pf send) 0xffffffff80f58ed0 in fork_trampoline () 286 Thread 101269 (PID=12: intr/irq815: xbd5) 0xffffffff80b25edd in sched_switch () 285 Thread 100003 (PID=11: idle/idle: cpu0) 0xffffffff80b25edd in sched_switch () 284 Thread 100004 (PID=11: idle/idle: cpu1) 0xffffffff810f10b8 in cpustop_handler () 283 Thread 100005 (PID=11: idle/idle: cpu2) 0xffffffff80b25edd in sched_switch () 282 Thread 100006 (PID=11: idle/idle: cpu3) 0xffffffff810f10b8 in cpustop_handler () 281 Thread 100002 (PID=1: init) 0xffffffff80b25edd in sched_switch () 280 Thread 100001 (PID=10: audit) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 279 Thread 100000 (PID=0: kernel/swapper) 0xffffffff80b25edd in sched_switch () 278 Thread 100007 (PID=0: kernel/thread taskq) 0xffffffff80b25edd in sched_switch () 277 Thread 100009 (PID=0: kernel/kqueue_ctx taskq) 0xffffffff80b25edd in sched_switch () 276 Thread 100011 (PID=0: kernel/aiod_kick taskq) 0xffffffff80b25edd in sched_switch () 275 Thread 100013 (PID=0: kernel/if_io_tqg_0) 0xffffffff80b25edd in sched_switch () 274 Thread 100014 (PID=0: kernel/if_io_tqg_1) 0xffffffff80b25edd in sched_switch () 273 Thread 100015 (PID=0: kernel/if_io_tqg_2) 0xffffffff80b25edd in sched_switch () 272 Thread 100016 (PID=0: kernel/if_io_tqg_3) 0xffffffff80b25edd in sched_switch () 271 Thread 100017 (PID=0: kernel/softirq_0) 0xffffffff80b25edd in sched_switch () 270 Thread 100018 (PID=0: kernel/softirq_1) 0xffffffff80b25edd in sched_switch () 269 Thread 100019 (PID=0: kernel/softirq_2) 0xffffffff80b25edd in sched_switch () 268 Thread 100020 (PID=0: kernel/softirq_3) 0xffffffff80b25edd in sched_switch () 267 Thread 100021 (PID=0: kernel/if_config_tqg_0) 0xffffffff80b25edd in sched_switch () 266 Thread 100031 (PID=0: kernel/firmware taskq) 0xffffffff80b25edd in sched_switch () 265 Thread 100043 (PID=0: kernel/mca taskq) 0xffffffff80b25edd in sched_switch () 264 Thread 100044 (PID=0: kernel/system_taskq_0) 0xffffffff80b25edd in sched_switch () 263 Thread 100045 (PID=0: kernel/system_taskq_1) 0xffffffff80b25edd in sched_switch () 262 Thread 100046 (PID=0: kernel/system_taskq_2) 0xffffffff80b25edd in sched_switch () 261 Thread 100047 (PID=0: kernel/system_taskq_3) 0xffffffff80b25edd in sched_switch () 260 Thread 100054 (PID=0: kernel/dbu_evict) 0xffffffff80b25edd in sched_switch () 259 Thread 100056 (PID=0: kernel/z_vdev_file_0) 0xffffffff80b25edd in sched_switch () 258 Thread 100057 (PID=0: kernel/z_vdev_file_1) 0xffffffff80b25edd in sched_switch () 257 Thread 100058 (PID=0: kernel/z_vdev_file_2) 0xffffffff80b25edd in sched_switch () 256 Thread 100059 (PID=0: kernel/z_vdev_file_3) 0xffffffff80b25edd in sched_switch () 255 Thread 100060 (PID=0: kernel/z_vdev_file_4) 0xffffffff80b25edd in sched_switch () 254 Thread 100061 (PID=0: kernel/z_vdev_file_5) 0xffffffff80b25edd in sched_switch () 253 Thread 100062 (PID=0: kernel/z_vdev_file_6) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 252 Thread 100063 (PID=0: kernel/z_vdev_file_7) 0xffffffff80b25edd in sched_switch () 251 Thread 100064 (PID=0: kernel/z_vdev_file_8) 0xffffffff80b25edd in sched_switch () 250 Thread 100065 (PID=0: kernel/z_vdev_file_9) 0xffffffff80b25edd in sched_switch () 249 Thread 100066 (PID=0: kernel/z_vdev_file_10) 0xffffffff80b25edd in sched_switch () 248 Thread 100067 (PID=0: kernel/z_vdev_file_11) 0xffffffff80b25edd in sched_switch () 247 Thread 100068 (PID=0: kernel/z_vdev_file_12) 0xffffffff80b25edd in sched_switch () 246 Thread 100069 (PID=0: kernel/z_vdev_file_13) 0xffffffff80b25edd in sched_switch () 245 Thread 100070 (PID=0: kernel/z_vdev_file_14) 0xffffffff80b25edd in sched_switch () 244 Thread 100071 (PID=0: kernel/z_vdev_file_15) 0xffffffff80b25edd in sched_switch () 243 Thread 100075 (PID=0: kernel/acpi_task_0) 0xffffffff80b25edd in sched_switch () 242 Thread 100076 (PID=0: kernel/acpi_task_1) 0xffffffff80b25edd in sched_switch () 241 Thread 100077 (PID=0: kernel/acpi_task_2) 0xffffffff80b25edd in sched_switch () 240 Thread 100078 (PID=0: kernel/CAM taskq) 0xffffffff80b25edd in sched_switch () 239 Thread 100095 (PID=0: kernel/zio_null_issue) 0xffffffff80b25edd in sched_switch () 238 Thread 100096 (PID=0: kernel/zio_null_intr) 0xffffffff80b25edd in sched_switch () 237 Thread 100097 (PID=0: kernel/zio_read_issue_0) 0xffffffff80b25edd in sched_switch () 236 Thread 100098 (PID=0: kernel/zio_read_issue_1) 0xffffffff80b25edd in sched_switch () 235 Thread 100099 (PID=0: kernel/zio_read_issue_2) 0xffffffff80b25edd in sched_switch () 234 Thread 100100 (PID=0: kernel/zio_read_issue_3) 0xffffffff80b25edd in sched_switch () 233 Thread 100101 (PID=0: kernel/zio_read_issue_4) 0xffffffff80b25edd in sched_switch () 232 Thread 100102 (PID=0: kernel/zio_read_issue_5) 0xffffffff80b25edd in sched_switch () 231 Thread 100103 (PID=0: kernel/zio_read_issue_6) 0xffffffff80b25edd in sched_switch () 230 Thread 100104 (PID=0: kernel/zio_read_issue_7) 0xffffffff80b25edd in sched_switch () 229 Thread 100105 (PID=0: kernel/zio_read_intr_0_0) 0xffffffff80b25edd in sched_switch () 228 Thread 100106 (PID=0: kernel/zio_read_intr_0_1) 0xffffffff80b25edd in sched_switch () 227 Thread 100107 (PID=0: kernel/zio_read_intr_0_2) 0xffffffff80b25edd in sched_switch () 226 Thread 100108 (PID=0: kernel/zio_read_intr_0_3) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 225 Thread 100109 (PID=0: kernel/zio_read_intr_0_4) 0xffffffff80b25edd in sched_switch () 224 Thread 100110 (PID=0: kernel/zio_read_intr_0_5) 0xffffffff80b25edd in sched_switch () 223 Thread 100111 (PID=0: kernel/zio_read_intr_0_6) 0xffffffff80b25edd in sched_switch () 222 Thread 100112 (PID=0: kernel/zio_read_intr_0_7) 0xffffffff80b25edd in sched_switch () 221 Thread 100113 (PID=0: kernel/zio_read_intr_0_8) 0xffffffff80b25edd in sched_switch () 220 Thread 100114 (PID=0: kernel/zio_read_intr_0_9) 0xffffffff80b25edd in sched_switch () 219 Thread 100115 (PID=0: kernel/zio_read_intr_0_10) 0xffffffff80b25edd in sched_switch () 218 Thread 100116 (PID=0: kernel/zio_read_intr_0_11) 0xffffffff80b25edd in sched_switch () 217 Thread 100117 (PID=0: kernel/zio_read_intr_1_0) 0xffffffff80b25edd in sched_switch () 216 Thread 100118 (PID=0: kernel/zio_read_intr_1_1) 0xffffffff80b25edd in sched_switch () 215 Thread 100119 (PID=0: kernel/zio_read_intr_1_2) 0xffffffff80b25edd in sched_switch () 214 Thread 100120 (PID=0: kernel/zio_read_intr_1_3) 0xffffffff80b25edd in sched_switch () 213 Thread 100121 (PID=0: kernel/zio_read_intr_1_4) 0xffffffff80b25edd in sched_switch () 212 Thread 100122 (PID=0: kernel/zio_read_intr_1_5) 0xffffffff80b25edd in sched_switch () 211 Thread 100123 (PID=0: kernel/zio_read_intr_1_6) 0xffffffff80b25edd in sched_switch () 210 Thread 100124 (PID=0: kernel/zio_read_intr_1_7) 0xffffffff80b25edd in sched_switch () 209 Thread 100125 (PID=0: kernel/zio_read_intr_1_8) 0xffffffff80b25edd in sched_switch () 208 Thread 100126 (PID=0: kernel/zio_read_intr_1_9) 0xffffffff80b25edd in sched_switch () 207 Thread 100127 (PID=0: kernel/zio_read_intr_1_10) 0xffffffff80b25edd in sched_switch () 206 Thread 100128 (PID=0: kernel/zio_read_intr_1_11) 0xffffffff80b25edd in sched_switch () 205 Thread 100129 (PID=0: kernel/zio_read_intr_2_0) 0xffffffff80b25edd in sched_switch () 204 Thread 100130 (PID=0: kernel/zio_read_intr_2_1) 0xffffffff80b25edd in sched_switch () 203 Thread 100131 (PID=0: kernel/zio_read_intr_2_2) 0xffffffff80b25edd in sched_switch () 202 Thread 100132 (PID=0: kernel/zio_read_intr_2_3) 0xffffffff80b25edd in sched_switch () 201 Thread 100133 (PID=0: kernel/zio_read_intr_2_4) 0xffffffff80b25edd in sched_switch () 200 Thread 100134 (PID=0: kernel/zio_read_intr_2_5) 0xffffffff80b25edd in sched_switch () 199 Thread 100135 (PID=0: kernel/zio_read_intr_2_6) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 198 Thread 100136 (PID=0: kernel/zio_read_intr_2_7) 0xffffffff80b25edd in sched_switch () 197 Thread 100137 (PID=0: kernel/zio_read_intr_2_8) 0xffffffff80b25edd in sched_switch () 196 Thread 100138 (PID=0: kernel/zio_read_intr_2_9) 0xffffffff80b25edd in sched_switch () 195 Thread 100139 (PID=0: kernel/zio_read_intr_2_10) 0xffffffff80b25edd in sched_switch () 194 Thread 100140 (PID=0: kernel/zio_read_intr_2_11) 0xffffffff80b25edd in sched_switch () 193 Thread 100141 (PID=0: kernel/zio_read_intr_3_0) 0xffffffff80b25edd in sched_switch () 192 Thread 100142 (PID=0: kernel/zio_read_intr_3_1) 0xffffffff80b25edd in sched_switch () 191 Thread 100143 (PID=0: kernel/zio_read_intr_3_2) 0xffffffff80b25edd in sched_switch () 190 Thread 100144 (PID=0: kernel/zio_read_intr_3_3) 0xffffffff80b25edd in sched_switch () 189 Thread 100145 (PID=0: kernel/zio_read_intr_3_4) 0xffffffff80b25edd in sched_switch () 188 Thread 100146 (PID=0: kernel/zio_read_intr_3_5) 0xffffffff80b25edd in sched_switch () 187 Thread 100147 (PID=0: kernel/zio_read_intr_3_6) 0xffffffff80b25edd in sched_switch () 186 Thread 100148 (PID=0: kernel/zio_read_intr_3_7) 0xffffffff80b25edd in sched_switch () 185 Thread 100149 (PID=0: kernel/zio_read_intr_3_8) 0xffffffff80b25edd in sched_switch () 184 Thread 100150 (PID=0: kernel/zio_read_intr_3_9) 0xffffffff80b25edd in sched_switch () 183 Thread 100151 (PID=0: kernel/zio_read_intr_3_10) 0xffffffff80b25edd in sched_switch () 182 Thread 100152 (PID=0: kernel/zio_read_intr_3_11) 0xffffffff80b25edd in sched_switch () 181 Thread 100153 (PID=0: kernel/zio_read_intr_4_0) 0xffffffff80b25edd in sched_switch () 180 Thread 100154 (PID=0: kernel/zio_read_intr_4_1) 0xffffffff80b25edd in sched_switch () 179 Thread 100155 (PID=0: kernel/zio_read_intr_4_2) 0xffffffff80b25edd in sched_switch () 178 Thread 100156 (PID=0: kernel/zio_read_intr_4_3) 0xffffffff80b25edd in sched_switch () 177 Thread 100157 (PID=0: kernel/zio_read_intr_4_4) 0xffffffff80b25edd in sched_switch () 176 Thread 100158 (PID=0: kernel/zio_read_intr_4_5) 0xffffffff80b25edd in sched_switch () 175 Thread 100159 (PID=0: kernel/zio_read_intr_4_6) 0xffffffff80b25edd in sched_switch () 174 Thread 100160 (PID=0: kernel/zio_read_intr_4_7) 0xffffffff80b25edd in sched_switch () 173 Thread 100161 (PID=0: kernel/zio_read_intr_4_8) 0xffffffff80b25edd in sched_switch () 172 Thread 100162 (PID=0: kernel/zio_read_intr_4_9) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 171 Thread 100163 (PID=0: kernel/zio_read_intr_4_10) 0xffffffff80b25edd in sched_switch () 170 Thread 100164 (PID=0: kernel/zio_read_intr_4_11) 0xffffffff80b25edd in sched_switch () 169 Thread 100165 (PID=0: kernel/zio_read_intr_5_0) 0xffffffff80b25edd in sched_switch () 168 Thread 100166 (PID=0: kernel/zio_read_intr_5_1) 0xffffffff80b25edd in sched_switch () 167 Thread 100167 (PID=0: kernel/zio_read_intr_5_2) 0xffffffff80b25edd in sched_switch () 166 Thread 100168 (PID=0: kernel/zio_read_intr_5_3) 0xffffffff80b25edd in sched_switch () 165 Thread 100169 (PID=0: kernel/zio_read_intr_5_4) 0xffffffff80b25edd in sched_switch () 164 Thread 100170 (PID=0: kernel/zio_read_intr_5_5) 0xffffffff80b25edd in sched_switch () 163 Thread 100171 (PID=0: kernel/zio_read_intr_5_6) 0xffffffff80b25edd in sched_switch () 162 Thread 100172 (PID=0: kernel/zio_read_intr_5_7) 0xffffffff80b25edd in sched_switch () 161 Thread 100173 (PID=0: kernel/zio_read_intr_5_8) 0xffffffff80b25edd in sched_switch () 160 Thread 100174 (PID=0: kernel/zio_read_intr_5_9) 0xffffffff80b25edd in sched_switch () 159 Thread 100175 (PID=0: kernel/zio_read_intr_5_10) 0xffffffff80b25edd in sched_switch () 158 Thread 100176 (PID=0: kernel/zio_read_intr_5_11) 0xffffffff80b25edd in sched_switch () 157 Thread 100177 (PID=0: kernel/zio_read_intr_6_0) 0xffffffff80b25edd in sched_switch () 156 Thread 100178 (PID=0: kernel/zio_read_intr_6_1) 0xffffffff80b25edd in sched_switch () 155 Thread 100179 (PID=0: kernel/zio_read_intr_6_2) 0xffffffff80b25edd in sched_switch () 154 Thread 100180 (PID=0: kernel/zio_read_intr_6_3) 0xffffffff80b25edd in sched_switch () 153 Thread 100181 (PID=0: kernel/zio_read_intr_6_4) 0xffffffff80b25edd in sched_switch () 152 Thread 100182 (PID=0: kernel/zio_read_intr_6_5) 0xffffffff80b25edd in sched_switch () 151 Thread 100183 (PID=0: kernel/zio_read_intr_6_6) 0xffffffff80b25edd in sched_switch () 150 Thread 100184 (PID=0: kernel/zio_read_intr_6_7) 0xffffffff80b25edd in sched_switch () 149 Thread 100185 (PID=0: kernel/zio_read_intr_6_8) 0xffffffff80b25edd in sched_switch () 148 Thread 100186 (PID=0: kernel/zio_read_intr_6_9) 0xffffffff80b25edd in sched_switch () 147 Thread 100187 (PID=0: kernel/zio_read_intr_6_10) 0xffffffff80b25edd in sched_switch () 146 Thread 100188 (PID=0: kernel/zio_read_intr_6_11) 0xffffffff80b25edd in sched_switch () 145 Thread 100189 (PID=0: kernel/zio_read_intr_7_0) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 144 Thread 100190 (PID=0: kernel/zio_read_intr_7_1) 0xffffffff80b25edd in sched_switch () 143 Thread 100191 (PID=0: kernel/zio_read_intr_7_2) 0xffffffff80b25edd in sched_switch () 142 Thread 100192 (PID=0: kernel/zio_read_intr_7_3) 0xffffffff80b25edd in sched_switch () 141 Thread 100193 (PID=0: kernel/zio_read_intr_7_4) 0xffffffff80b25edd in sched_switch () 140 Thread 100194 (PID=0: kernel/zio_read_intr_7_5) 0xffffffff80b25edd in sched_switch () 139 Thread 100195 (PID=0: kernel/zio_read_intr_7_6) 0xffffffff80b25edd in sched_switch () 138 Thread 100196 (PID=0: kernel/zio_read_intr_7_7) 0xffffffff80b25edd in sched_switch () 137 Thread 100197 (PID=0: kernel/zio_read_intr_7_8) 0xffffffff80b25edd in sched_switch () 136 Thread 100198 (PID=0: kernel/zio_read_intr_7_9) 0xffffffff80b25edd in sched_switch () 135 Thread 100199 (PID=0: kernel/zio_read_intr_7_10) 0xffffffff80b25edd in sched_switch () 134 Thread 100200 (PID=0: kernel/zio_read_intr_7_11) 0xffffffff80b25edd in sched_switch () 133 Thread 100201 (PID=0: kernel/zio_write_issue_0) 0xffffffff80b25edd in sched_switch () 132 Thread 100202 (PID=0: kernel/zio_write_issue_1) 0xffffffff80b25edd in sched_switch () 131 Thread 100203 (PID=0: kernel/zio_write_issue_2) 0xffffffff80b25edd in sched_switch () 130 Thread 100204 (PID=0: kernel/zio_write_issue_hig) 0xffffffff80b25edd in sched_switch () 129 Thread 100205 (PID=0: kernel/zio_write_issue_hig) 0xffffffff80b25edd in sched_switch () 128 Thread 100206 (PID=0: kernel/zio_write_issue_hig) 0xffffffff80b25edd in sched_switch () 127 Thread 100207 (PID=0: kernel/zio_write_issue_hig) 0xffffffff80b25edd in sched_switch () 126 Thread 100208 (PID=0: kernel/zio_write_issue_hig) 0xffffffff80b25edd in sched_switch () 125 Thread 100209 (PID=0: kernel/zio_write_intr_0) 0xffffffff80b25edd in sched_switch () 124 Thread 100210 (PID=0: kernel/zio_write_intr_1) 0xffffffff80b25edd in sched_switch () 123 Thread 100211 (PID=0: kernel/zio_write_intr_2) 0xffffffff80b25edd in sched_switch () 122 Thread 100212 (PID=0: kernel/zio_write_intr_3) 0xffffffff80b25edd in sched_switch () 121 Thread 100213 (PID=0: kernel/zio_write_intr_4) 0xffffffff80b25edd in sched_switch () 120 Thread 100214 (PID=0: kernel/zio_write_intr_5) 0xffffffff80b25edd in sched_switch () 119 Thread 100215 (PID=0: kernel/zio_write_intr_6) 0xffffffff80b25edd in sched_switch () 118 Thread 100216 (PID=0: kernel/zio_write_intr_7) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 117 Thread 100217 (PID=0: kernel/zio_write_intr_high) 0xffffffff80b25edd in sched_switch () 116 Thread 100218 (PID=0: kernel/zio_write_intr_high) 0xffffffff80b25edd in sched_switch () 115 Thread 100219 (PID=0: kernel/zio_write_intr_high) 0xffffffff80b25edd in sched_switch () 114 Thread 100220 (PID=0: kernel/zio_write_intr_high) 0xffffffff80b25edd in sched_switch () 113 Thread 100221 (PID=0: kernel/zio_write_intr_high) 0xffffffff80b25edd in sched_switch () 112 Thread 100222 (PID=0: kernel/zio_free_issue_0_0) 0xffffffff80b25edd in sched_switch () 111 Thread 100223 (PID=0: kernel/zio_free_issue_0_1) 0xffffffff80b25edd in sched_switch () 110 Thread 100224 (PID=0: kernel/zio_free_issue_0_2) 0xffffffff80b25edd in sched_switch () 109 Thread 100225 (PID=0: kernel/zio_free_issue_0_3) 0xffffffff80b25edd in sched_switch () 108 Thread 100226 (PID=0: kernel/zio_free_issue_0_4) 0xffffffff80b25edd in sched_switch () 107 Thread 100227 (PID=0: kernel/zio_free_issue_0_5) 0xffffffff80b25edd in sched_switch () 106 Thread 100228 (PID=0: kernel/zio_free_issue_0_6) 0xffffffff80b25edd in sched_switch () 105 Thread 100229 (PID=0: kernel/zio_free_issue_0_7) 0xffffffff80b25edd in sched_switch () 104 Thread 100230 (PID=0: kernel/zio_free_issue_0_8) 0xffffffff80b25edd in sched_switch () 103 Thread 100231 (PID=0: kernel/zio_free_issue_0_9) 0xffffffff80b25edd in sched_switch () 102 Thread 100232 (PID=0: kernel/zio_free_issue_0_10) 0xffffffff80b25edd in sched_switch () 101 Thread 100233 (PID=0: kernel/zio_free_issue_0_11) 0xffffffff80b25edd in sched_switch () 100 Thread 100234 (PID=0: kernel/zio_free_issue_1_0) 0xffffffff80b25edd in sched_switch () 99 Thread 100235 (PID=0: kernel/zio_free_issue_1_1) 0xffffffff80b25edd in sched_switch () 98 Thread 100236 (PID=0: kernel/zio_free_issue_1_2) 0xffffffff80b25edd in sched_switch () 97 Thread 100237 (PID=0: kernel/zio_free_issue_1_3) 0xffffffff80b25edd in sched_switch () 96 Thread 100238 (PID=0: kernel/zio_free_issue_1_4) 0xffffffff80b25edd in sched_switch () 95 Thread 100239 (PID=0: kernel/zio_free_issue_1_5) 0xffffffff80b25edd in sched_switch () 94 Thread 100240 (PID=0: kernel/zio_free_issue_1_6) 0xffffffff80b25edd in sched_switch () 93 Thread 100241 (PID=0: kernel/zio_free_issue_1_7) 0xffffffff80b25edd in sched_switch () 92 Thread 100242 (PID=0: kernel/zio_free_issue_1_8) 0xffffffff80b25edd in sched_switch () 91 Thread 100243 (PID=0: kernel/zio_free_issue_1_9) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 90 Thread 100244 (PID=0: kernel/zio_free_issue_1_10) 0xffffffff80b25edd in sched_switch () 89 Thread 100245 (PID=0: kernel/zio_free_issue_1_11) 0xffffffff80b25edd in sched_switch () 88 Thread 100246 (PID=0: kernel/zio_free_issue_2_0) 0xffffffff80b25edd in sched_switch () 87 Thread 100247 (PID=0: kernel/zio_free_issue_2_1) 0xffffffff80b25edd in sched_switch () 86 Thread 100248 (PID=0: kernel/zio_free_issue_2_2) 0xffffffff80b25edd in sched_switch () 85 Thread 100249 (PID=0: kernel/zio_free_issue_2_3) 0xffffffff80b25edd in sched_switch () 84 Thread 100250 (PID=0: kernel/zio_free_issue_2_4) 0xffffffff80b25edd in sched_switch () 83 Thread 100251 (PID=0: kernel/zio_free_issue_2_5) 0xffffffff80b25edd in sched_switch () 82 Thread 100252 (PID=0: kernel/zio_free_issue_2_6) 0xffffffff80b25edd in sched_switch () 81 Thread 100253 (PID=0: kernel/zio_free_issue_2_7) 0xffffffff80b25edd in sched_switch () 80 Thread 100254 (PID=0: kernel/zio_free_issue_2_8) 0xffffffff80b25edd in sched_switch () 79 Thread 100255 (PID=0: kernel/zio_free_issue_2_9) 0xffffffff80b25edd in sched_switch () 78 Thread 100256 (PID=0: kernel/zio_free_issue_2_10) 0xffffffff80b25edd in sched_switch () 77 Thread 100257 (PID=0: kernel/zio_free_issue_2_11) 0xffffffff80b25edd in sched_switch () 76 Thread 100258 (PID=0: kernel/zio_free_issue_3_0) 0xffffffff80b25edd in sched_switch () 75 Thread 100259 (PID=0: kernel/zio_free_issue_3_1) 0xffffffff80b25edd in sched_switch () 74 Thread 100260 (PID=0: kernel/zio_free_issue_3_2) 0xffffffff80b25edd in sched_switch () 73 Thread 100261 (PID=0: kernel/zio_free_issue_3_3) 0xffffffff80b25edd in sched_switch () 72 Thread 100262 (PID=0: kernel/zio_free_issue_3_4) 0xffffffff80b25edd in sched_switch () 71 Thread 100263 (PID=0: kernel/zio_free_issue_3_5) 0xffffffff80b25edd in sched_switch () 70 Thread 100264 (PID=0: kernel/zio_free_issue_3_6) 0xffffffff80b25edd in sched_switch () 69 Thread 100265 (PID=0: kernel/zio_free_issue_3_7) 0xffffffff80b25edd in sched_switch () 68 Thread 100266 (PID=0: kernel/zio_free_issue_3_8) 0xffffffff80b25edd in sched_switch () 67 Thread 100267 (PID=0: kernel/zio_free_issue_3_9) 0xffffffff80b25edd in sched_switch () 66 Thread 100268 (PID=0: kernel/zio_free_issue_3_10) 0xffffffff80b25edd in sched_switch () 65 Thread 100269 (PID=0: kernel/zio_free_issue_3_11) 0xffffffff80b25edd in sched_switch () 64 Thread 100270 (PID=0: kernel/zio_free_issue_4_0) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 63 Thread 100271 (PID=0: kernel/zio_free_issue_4_1) 0xffffffff80b25edd in sched_switch () 62 Thread 100272 (PID=0: kernel/zio_free_issue_4_2) 0xffffffff80b25edd in sched_switch () 61 Thread 100273 (PID=0: kernel/zio_free_issue_4_3) 0xffffffff80b25edd in sched_switch () 60 Thread 100274 (PID=0: kernel/zio_free_issue_4_4) 0xffffffff80b25edd in sched_switch () 59 Thread 100275 (PID=0: kernel/zio_free_issue_4_5) 0xffffffff80b25edd in sched_switch () 58 Thread 100276 (PID=0: kernel/zio_free_issue_4_6) 0xffffffff80b25edd in sched_switch () 57 Thread 100277 (PID=0: kernel/zio_free_issue_4_7) 0xffffffff80b25edd in sched_switch () 56 Thread 100278 (PID=0: kernel/zio_free_issue_4_8) 0xffffffff80b25edd in sched_switch () 55 Thread 100279 (PID=0: kernel/zio_free_issue_4_9) 0xffffffff80b25edd in sched_switch () 54 Thread 100280 (PID=0: kernel/zio_free_issue_4_10) 0xffffffff80b25edd in sched_switch () 53 Thread 100281 (PID=0: kernel/zio_free_issue_4_11) 0xffffffff80b25edd in sched_switch () 52 Thread 100282 (PID=0: kernel/zio_free_issue_5_0) 0xffffffff80b25edd in sched_switch () 51 Thread 100283 (PID=0: kernel/zio_free_issue_5_1) 0xffffffff80b25edd in sched_switch () 50 Thread 100284 (PID=0: kernel/zio_free_issue_5_2) 0xffffffff80b25edd in sched_switch () 49 Thread 100285 (PID=0: kernel/zio_free_issue_5_3) 0xffffffff80b25edd in sched_switch () 48 Thread 100286 (PID=0: kernel/zio_free_issue_5_4) 0xffffffff80b25edd in sched_switch () 47 Thread 100287 (PID=0: kernel/zio_free_issue_5_5) 0xffffffff80b25edd in sched_switch () 46 Thread 100288 (PID=0: kernel/zio_free_issue_5_6) 0xffffffff80b25edd in sched_switch () 45 Thread 100289 (PID=0: kernel/zio_free_issue_5_7) 0xffffffff80b25edd in sched_switch () 44 Thread 100290 (PID=0: kernel/zio_free_issue_5_8) 0xffffffff80b25edd in sched_switch () 43 Thread 100291 (PID=0: kernel/zio_free_issue_5_9) 0xffffffff80b25edd in sched_switch () 42 Thread 100292 (PID=0: kernel/zio_free_issue_5_10) 0xffffffff80b25edd in sched_switch () 41 Thread 100293 (PID=0: kernel/zio_free_issue_5_11) 0xffffffff80b25edd in sched_switch () 40 Thread 100294 (PID=0: kernel/zio_free_issue_6_0) 0xffffffff80b25edd in sched_switch () 39 Thread 100295 (PID=0: kernel/zio_free_issue_6_1) 0xffffffff80b25edd in sched_switch () 38 Thread 100296 (PID=0: kernel/zio_free_issue_6_2) 0xffffffff80b25edd in sched_switch () 37 Thread 100297 (PID=0: kernel/zio_free_issue_6_3) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 36 Thread 100298 (PID=0: kernel/zio_free_issue_6_4) 0xffffffff80b25edd in sched_switch () 35 Thread 100299 (PID=0: kernel/zio_free_issue_6_5) 0xffffffff80b25edd in sched_switch () 34 Thread 100300 (PID=0: kernel/zio_free_issue_6_6) 0xffffffff80b25edd in sched_switch () 33 Thread 100301 (PID=0: kernel/zio_free_issue_6_7) 0xffffffff80b25edd in sched_switch () 32 Thread 100302 (PID=0: kernel/zio_free_issue_6_8) 0xffffffff80b25edd in sched_switch () 31 Thread 100303 (PID=0: kernel/zio_free_issue_6_9) 0xffffffff80b25edd in sched_switch () 30 Thread 100304 (PID=0: kernel/zio_free_issue_6_10) 0xffffffff80b25edd in sched_switch () 29 Thread 100305 (PID=0: kernel/zio_free_issue_6_11) 0xffffffff80b25edd in sched_switch () 28 Thread 100306 (PID=0: kernel/zio_free_issue_7_0) 0xffffffff80b25edd in sched_switch () 27 Thread 100307 (PID=0: kernel/zio_free_issue_7_1) 0xffffffff80b25edd in sched_switch () 26 Thread 100308 (PID=0: kernel/zio_free_issue_7_2) 0xffffffff80b25edd in sched_switch () 25 Thread 100309 (PID=0: kernel/zio_free_issue_7_3) 0xffffffff80b25edd in sched_switch () 24 Thread 100310 (PID=0: kernel/zio_free_issue_7_4) 0xffffffff80b25edd in sched_switch () 23 Thread 100311 (PID=0: kernel/zio_free_issue_7_5) 0xffffffff80b25edd in sched_switch () 22 Thread 100312 (PID=0: kernel/zio_free_issue_7_6) 0xffffffff80b25edd in sched_switch () 21 Thread 100313 (PID=0: kernel/zio_free_issue_7_7) 0xffffffff80b25edd in sched_switch () 20 Thread 100314 (PID=0: kernel/zio_free_issue_7_8) 0xffffffff80b25edd in sched_switch () 19 Thread 100315 (PID=0: kernel/zio_free_issue_7_9) 0xffffffff80b25edd in sched_switch () 18 Thread 100316 (PID=0: kernel/zio_free_issue_7_10) 0xffffffff80b25edd in sched_switch () 17 Thread 100317 (PID=0: kernel/zio_free_issue_7_11) 0xffffffff80b25edd in sched_switch () 16 Thread 100318 (PID=0: kernel/zio_free_intr) 0xffffffff80b25edd in sched_switch () 15 Thread 100319 (PID=0: kernel/zio_claim_issue) 0xffffffff80b25edd in sched_switch () 14 Thread 100320 (PID=0: kernel/zio_claim_intr) 0xffffffff80b25edd in sched_switch () 13 Thread 100321 (PID=0: kernel/zio_ioctl_issue) 0xffffffff80b25edd in sched_switch () 12 Thread 100322 (PID=0: kernel/zio_ioctl_intr) 0xffffffff80b25edd in sched_switch () 11 Thread 100326 (PID=0: kernel/dp_sync_taskq_0) 0xffffffff80b25edd in sched_switch () 10 Thread 100327 (PID=0: kernel/dp_sync_taskq_1) 0xffffffff80b25edd in sched_switch () ---Type <return> to continue, or q <return> to quit--- 9 Thread 100328 (PID=0: kernel/dp_sync_taskq_2) 0xffffffff80b25edd in sched_switch () 8 Thread 100329 (PID=0: kernel/dp_zil_clean_taskq_) 0xffffffff80b25edd in sched_switch () 7 Thread 100330 (PID=0: kernel/dp_zil_clean_taskq_) 0xffffffff80b25edd in sched_switch () 6 Thread 100331 (PID=0: kernel/dp_zil_clean_taskq_) 0xffffffff80b25edd in sched_switch () 5 Thread 100332 (PID=0: kernel/dp_zil_clean_taskq_) 0xffffffff80b25edd in sched_switch () 4 Thread 100333 (PID=0: kernel/zfs_vn_rele_taskq) 0xffffffff80b25edd in sched_switch () 3 Thread 100334 (PID=0: kernel/metaslab_group_task) 0xffffffff80b25edd in sched_switch () 2 Thread 100335 (PID=0: kernel/metaslab_group_task) 0xffffffff80b25edd in sched_switch () 1 Thread 100368 (PID=0: kernel/xn0 txq 0) 0xffffffff80b25edd in sched_switch () Here's a different stack trace I found: $ sudo procstat -kk 92301 PID TID COMM TDNAME KSTACK 92301 100454 sync - mi_switch+0xe6 critical_exit+0x6d xen_intr_handle_upcall+0x1b7 Xxen_intr_upcall+0xa5 vputx+0x2ea zfs_zget+0x1aa zfs_get_data+0x3c zil_commit_impl+0xa9a zfs_sync+0xa6 sys_sync+0x116 amd64_syscall+0xa38 fast_syscall_common+0x101 This sync process hung (I'm running find, sync and other stuff in a loop to trigger the issue faster). Later I seemingly couldn't run procstat on (the still existing) PID 92301 anymore... `procstat` hangs after outputting column headers and my SSH connection goes away i.e. instance has crashed. (In reply to Allan Jude from comment #21) Based on Andreas' documentation (https://github.com/AndiDog/zfs-lockup) I could reproduce the issue both on AWS and in a bhyve VM running 11.2-p4 (on top of 12-BETA2). In both environments, Allan's patch fixed the problem for me (i.e. I can't reproduce the issue any more when running a patched kernel). You can discard all my statements from comment 27 onwards. I made a blatant mistake while installing the patched kernel to our live CI host (where the issue appeared originally), so my repeated findings were with an old kernel. After testing again with the patched kernel, I could *not* reproduce the issue anymore. Will keep two instances running with the new kernel for a few days, one of them with heavy filesystem action in background loops. That will show whether the patch is complete and stable. Sorry for the confusion! As a side note, not directly related to this PR and its fix: While load testing to provoke the problem I ran into the issue described in bug #209759 and bug #219355 (geli encrypted swap interlocks under memory pressure, resulting in "swap_pager: indefinite wait buffer" messages and essentially hanging the instance). (In reply to Michael Gmelin from comment #35) The patch I committed only stops the deadlock in the case where the vnode being reclaimed belongs to a deleted file. It can still happen in other circumstances. I am wondering if the correct approach is just to limit the number of times it can loop, and return an error. I am not sure what side effects returning an error will have. (In reply to Allan Jude from comment #38) > The patch I committed only stops the deadlock in the case where the vnode being reclaimed belongs to a deleted file. It can still happen in other circumstances. Obviously (like avg pointed our), but the main issue in this test case was `find` iterating over files that were deleted. So beforehand it would hang 9 out of 10 times when running the test case, while I couldn't reproduce it after patching the kernel. So the patch is effective for this specific scenario. Let's see if Andreas' CI runs into issues again (like he pointed out, his observations in the last two weeks were done using an unpatched kernel by accident). > I am wondering if the correct approach is just to limit the number of times it can loop, and return an error. I am not sure what side effects returning an error will have. Based on my gut feeling this sounds dangerous, just thinking of possible scenarios where one process could cause another process to make an important write fail that way where it wasn't expected. What kind of error would you return in such a scenario, maybe EINTR? (In reply to Allan Jude from comment #38) Andreas managed to reproduce the issue with the patch (like predicted by avg and you), but it took days of hammering the system heavily. @Allan/avg: Do you have an overview of all the possible code paths in which this interlock can happen? I didn't look closely at the code myself yet. If it's just about reclaiming a vnode, we could just allow that to fail and either retry the operation or allow exceeding kern.maxvnodes temporarily (like in a kind of soft-fail scenario). (In reply to Michael Gmelin from comment #40) Here are the things I tried to capture the latest issue I found. As Michael noted, this happened with the patched kernel (on 11.2), after days of stomping the system with I/O-heavy infinite background loops. Interesting fact: `top` does not update anymore once per second, but can be updated with the Space key as usual. Also, my `screen` processes seem gone – the ones I used to run the background loops. Please note that we now have two processes in zil_commit_impl, and possibly those two are interlocked?! I'm keeping the system in this state for now, just in case someone has an idea which debug information we should collect apart from the below. ----------------- root@asommer-devpm-ci:/usr/home/asommer # top last pid: 58504; load averages: 2.42, 3.37, 3.19 up 3+10:22:17 02:49:54 61 processes: 1 running, 59 sleeping, 1 zombie CPU: % user, % nice, % system, % interrupt, % idle Mem: 69M Active, 1650M Inact, 10G Wired, 3544M Free ARC: 4243M Total, 1624M MFU, 1829M MRU, 2539K Anon, 111M Header, 775M Other 2517M Compressed, 4824M Uncompressed, 1.92:1 Ratio Swap: 1024M Total, 1024M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1326 1001 3 44 0 177M 169M select 3 197:02 53.96% python3.6 # <------------ 1777 1001 1 52 0 38216K 33724K select 0 1:09 0.00% python3.6 23006 root 1 20 0 10352K 6592K select 0 0:55 0.00% ssh 833 root 1 20 0 6464K 2396K nanslp 0 0:23 0.00% cron 1220 root 1 20 0 6336K 2016K nanslp 1 0:10 0.00% svscan 1047 root 1 20 0 6336K 2016K nanslp 3 0:10 0.00% svscan 1066 root 1 20 0 10444K 6260K select 1 0:06 0.00% sendmail 1240 root 1 20 0 6464K 2404K nanslp 3 0:04 0.00% cron 816 root 1 20 0 10452K 6300K select 0 0:04 0.00% sendmail 1233 root 1 20 0 10444K 6264K select 0 0:04 0.00% sendmail 1073 root 1 20 0 6464K 2400K nanslp 2 0:03 0.00% cron 513 root 1 20 0 9180K 5120K select 2 0:02 0.00% devd 596 root 1 20 0 6412K 2420K select 0 0:01 0.00% syslogd 999 root 1 20 0 6420K 2464K select 0 0:01 0.00% syslogd 1167 root 1 20 0 6420K 2524K select 1 0:01 0.00% syslogd 1236 smmsp 1 20 0 10444K 5960K pause 3 0:00 0.00% sendmail 778 www 1 20 0 13332K 8668K kqread 0 0:00 0.00% nginx 461 root 1 42 0 6544K 2380K select 0 0:00 0.00% dhclient 23003 root 1 44 0 7596K 3616K select 0 0:00 0.00% sudo 1069 smmsp 1 20 0 10444K 5556K pause 2 0:00 0.00% sendmail 819 smmsp 1 20 0 10452K 5912K pause 2 0:00 0.00% sendmail 885 root 1 52 0 6408K 2124K ttyin 1 0:00 0.00% getty 507 _dhcp 1 20 0 6544K 2512K select 2 0:00 0.00% dhclient 58428 asommer 1 23 0 13160K 8012K select 0 0:00 0.00% sshd 58433 root 1 52 0 7412K 4048K pause 3 0:00 0.00% csh 58494 root 1 23 0 13160K 7956K select 3 0:00 0.00% sshd 58426 root 1 23 0 13160K 7956K select 0 0:00 0.00% sshd 1224 root 1 20 0 4244K 1976K select 0 0:00 0.00% supervise 1052 root 1 20 0 4244K 1976K select 0 0:00 0.00% supervise 58501 root 1 47 0 7412K 4068K pause 0 0:00 0.00% csh 1225 root 1 20 0 4244K 1976K select 2 0:00 0.00% supervise 1053 root 1 20 0 4244K 1976K select 3 0:00 0.00% supervise 58482 root 1 52 0 7064K 2868K wait 2 0:00 0.00% sh 58496 asommer 1 23 0 13160K 7968K select 2 0:00 0.00% sshd 58378 asommer 1 21 0 0K 16K vmo_de 2 0:00 0.00% sshd 797 root 1 21 0 12848K 7108K select 0 0:00 0.00% sshd 1776 root 1 20 0 7584K 3624K select 2 0:00 0.00% sudo 58497 asommer 1 26 0 7796K 3784K wait 2 0:00 0.00% bash 58499 root 1 26 0 7596K 3616K select 1 0:00 0.00% sudo 58431 root 1 26 0 7596K 3616K select 3 0:00 0.00% sudo 58429 asommer 1 26 0 7796K 3784K wait 0 0:00 0.00% bash 58381 asommer 1 20 0 4232K 1928K zcw->z 2 0:00 0.00% sync # <------------ 1230 1001 1 20 0 4260K 1988K piperd 3 0:00 0.00% multilog 58432 root 1 26 0 6932K 2900K wait 2 0:00 0.00% su 23002 asommer 1 20 0 7600K 4132K select 1 0:00 0.00% screen 58500 root 1 26 0 6932K 2900K wait 0 0:00 0.00% su 1059 1001 1 52 0 4260K 1984K piperd 1 0:00 0.00% multilog 58504 root 1 46 0 7916K 3140K CPU3 3 0:00 0.00% top 882 root 1 52 0 6408K 2128K ttyin 0 0:00 0.00% getty 877 root 1 52 0 6408K 2128K ttyin 2 0:00 0.00% getty 879 root 1 52 0 6408K 2128K ttyin 2 0:00 0.00% getty 878 root 1 52 0 6408K 2128K ttyin 3 0:00 0.00% getty 881 root 1 52 0 6408K 2128K ttyin 3 0:00 0.00% getty 884 root 1 52 0 6408K 2128K ttyin 0 0:00 0.00% getty 880 root 1 52 0 6408K 2128K ttyin 1 0:00 0.00% getty 883 root 1 52 0 6408K 2128K ttyin 3 0:00 0.00% getty 1048 root 1 52 0 4232K 1960K piperd 1 0:00 0.00% readproctitle root@asommer-devpm-ci:/usr/home/asommer # procstat -kk 1326 PID TID COMM TDNAME KSTACK 1326 100422 python3.6 - mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_timedwait_sig+0x14 _cv_timedwait_sig_sbt+0x18c seltdwait+0x7b kern_poll+0x3fd sys_poll+0x50 amd64_syscall+0xa38 fast_syscall_common+0x101 1326 100433 python3.6 - mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_sem2_wait+0x68a __umtx_op_sem2_wait+0x4b amd64_syscall+0xa38 fast_syscall_common+0x101 1326 100436 python3.6 - mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c zil_commit_impl+0xcd3 zfs_freebsd_putpages+0x635 VOP_PUTPAGES_APV+0x82 vnode_pager_putpages+0x8e vm_pageout_flush+0xea vm_object_page_collect_flush+0x213 vm_object_page_clean+0x146 vm_object_terminate+0x93 zfs_freebsd_reclaim+0x1e VOP_RECLAIM_APV+0x82 vgonel+0x208 vrecycle+0x4a zfs_freebsd_inactive+0xd VOP_INACTIVE_APV+0x82 vinactive+0xfc root@asommer-devpm-ci:/usr/home/asommer # procstat 58381 PID PPID PGID SID TSID THR LOGIN WCHAN EMUL COMM 58381 1 58381 58379 58379 1 asommer zcw->zcw FreeBSD ELF64 sync root@asommer-devpm-ci:/usr/home/asommer # procstat -f 58381 PID COMM FD T V FLAGS REF OFFSET PRO NAME 58381 sync text v r r------- - - - /bin/sync 58381 sync cwd v d r------- - - - /usr/home/asommer 58381 sync root v d r------- - - - / 58381 sync 0 v x rw------ 3 40 - - 58381 sync 1 v x rw------ 3 40 - - 58381 sync 2 v x rw------ 3 40 - - root@asommer-devpm-ci:/usr/home/asommer # procstat -r 58381 PID COMM RESOURCE VALUE 58381 sync user time 00:00:00.000000 58381 sync system time 00:00:00.003656 58381 sync maximum RSS 0.0 KB 58381 sync integral shared memory 0.0 KB 58381 sync integral unshared data 0.0 KB 58381 sync integral unshared stack 0.0 KB 58381 sync page reclaims 89 58381 sync page faults 0 58381 sync swaps 0 58381 sync block reads 3 58381 sync block writes 0 58381 sync messages sent 0 58381 sync messages received 0 58381 sync signals received 0 58381 sync voluntary context switches 6 58381 sync involuntary context switches 0 root@asommer-devpm-ci:/usr/home/asommer # procstat -t 58381 PID TID COMM TDNAME CPU PRI STATE WCHAN 58381 101315 sync - -1 120 sleep zcw->zcw root@asommer-devpm-ci:/usr/home/asommer # procstat -k 58381 PID TID COMM TDNAME KSTACK 58381 101315 sync - mi_switch sleepq_timedwait _cv_timedwait_sbt zil_commit_impl zfs_sync sys_sync amd64_syscall fast_syscall_common root@asommer-devpm-ci:/usr/home/asommer # procstat -kk 58381 PID TID COMM TDNAME KSTACK 58381 101315 sync - mi_switch+0xe6 sleepq_timedwait+0x2f _cv_timedwait_sbt+0x18c zil_commit_impl+0xcd3 zfs_sync+0xa6 sys_sync+0x116 amd64_syscall+0xa38 fast_syscall_common+0x101 A commit references this bug: Author: allanjude Date: Tue Dec 11 19:34:25 UTC 2018 New revision: 341828 URL: https://svnweb.freebsd.org/changeset/base/341828 Log: MFC: r339289: Resolve a hang in ZFS during vnode reclaimation This is caused by a deadlock between zil_commit() and zfs_zget() Add a way for zfs_zget() to break out of the retry loop in the common case PR: 229614, 231117 Reported by: grembo, jhb, Andreas Sommer, others Relnotes: yes Sponsored by: Klara Systems Changes: _U stable/11/ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c I am just noting this here so I can find it later. A similar issue may have been seen over on ZFS-on-Linux, and was solved a little differently: https://github.com/behlendorf/zfs/commit/761d7acaeb2bddc58a98f013ef322b9d2e585cb4 (In reply to Allan Jude from comment #43) I think it's a little bit different issue, though. It seems to be related to base r334810. A commit references this bug: Author: emaste Date: Wed Dec 19 18:05:50 UTC 2018 New revision: 342226 URL: https://svnweb.freebsd.org/changeset/base/342226 Log: MFS11 r341828: Resolve a hang in ZFS during vnode reclaimation This is caused by a deadlock between zil_commit() and zfs_zget() Add a way for zfs_zget() to break out of the retry loop in the common case PR: 229614, 231117 Submitted by: allanjude Approved by: so Security: FreeBSD-EN-18:18.zfs Sponsored by: Klara Systems, The FreeBSD Foundation Changes: _U releng/11.2/ releng/11.2/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c Coming here by way of https://github.com/zfsonlinux/zfs/issues/7425 Re ZoL comment #43, the unlinked check was removed in https://github.com/zfsonlinux/zfs/commit/987014903f9d36783547188b6ad00f01d9a076bd The check apparently caused some issues with xattr functionality, not sure that applies in FreeBSD but this loop is still happening in ZoL potentially looks like. Should we just try limiting the number of retries in there? ^Triage: Track MFC to stable branch. This ended up in releng/11.2 @ Allan / Ed Is this now resolved? (In reply to Kubilay Kocak from comment #47) Koobs: the initial patch was committed, MFCd, and an EN was issues. However, it seems to have made the issue less frequent, rather than resolving it entirely. I have not had the time recently to develop a better patch. Do any others subscribed to this bug still face this issue? Is there an updated reproduction case? (In reply to Allan Jude from comment #48) I never had this happen to me again since the patch on my heavily used poudriere box, which could trigger the issue on a regular basis. However it's just one host, not really a statistic. There was just another upstream commit to this area of the code. Is anyone still having the issue? If so, does this change help? https://github.com/zfsonlinux/zfs/commit/0c46813805f428038f0f5e2541cdab8034c2f082 (In reply to Allan Jude from comment #50) All good here so far, with heavy poudriere usage that used to trigger the issue quite fast. It's hard to tell if there are any lingering issues here. Andreas reported that the issue was still present on a patched kernel, but then in comment 36 said that an error resulted in testing of the wrong kernel. Is there anything left to investigate here? We see the following backtraces: Tracing command chrome pid 6985 tid 101536 td 0xfffff8004358e000 sched_switch() at sched_switch+0x8ad/frame 0xfffffe00325c6570 mi_switch() at mi_switch+0xe1/frame 0xfffffe00325c65a0 sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe00325c65d0 _sleep() at _sleep+0x237/frame 0xfffffe00325c6650 vm_object_pip_wait() at vm_object_pip_wait+0x47/frame 0xfffffe00325c6680 bufobj_invalbuf() at bufobj_invalbuf+0x2f1/frame 0xfffffe00325c66f0 vm_object_terminate() at vm_object_terminate+0xf5/frame 0xfffffe00325c6740 zfs_freebsd_reclaim() at zfs_freebsd_reclaim+0x1e/frame 0xfffffe00325c6770 VOP_RECLAIM_APV() at VOP_RECLAIM_APV+0x7e/frame 0xfffffe00325c67a0 vgonel() at vgonel+0x202/frame 0xfffffe00325c6820 vrecycle() at vrecycle+0x4a/frame 0xfffffe00325c6850 zfs_freebsd_inactive() at zfs_freebsd_inactive+0xd/frame 0xfffffe00325c6860 VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0x7e/frame 0xfffffe00325c6890 vinactive() at vinactive+0xff/frame 0xfffffe00325c68f0 vputx() at vputx+0x2b7/frame 0xfffffe00325c6950 vm_object_deallocate() at vm_object_deallocate+0x69c/frame 0xfffffe00325c69b0 _vm_map_unlock() at _vm_map_unlock+0xd9/frame 0xfffffe00325c69d0 vm_map_remove() at vm_map_remove+0xb2/frame 0xfffffe00325c6a00 vmspace_exit() at vmspace_exit+0xcd/frame 0xfffffe00325c6a40 exit1() at exit1+0x599/frame 0xfffffe00325c6ab0 sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00325c6ac0 amd64_syscall() at amd64_syscall+0x34b/frame 0xfffffe00325c6bf0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00325c6bf0 --- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80d2e5e5a, rsp = 0x7fffdfbfb058, rbp = 0x7fffdfbfb060 --- Tracing command chrome pid 21874 tid 101281 td 0xfffff8002bbfe000 (CPU 1) _sx_xlock_hard() at _sx_xlock_hard+0x19a/frame 0xfffffe00324f41e0 rrw_enter_read_impl() at rrw_enter_read_impl+0x9b/frame 0xfffffe00324f4230 zfs_freebsd_putpages() at zfs_freebsd_putpages+0x50/frame 0xfffffe00324f4370 VOP_PUTPAGES_APV() at VOP_PUTPAGES_APV+0x7e/frame 0xfffffe00324f43a0 vnode_pager_putpages() at vnode_pager_putpages+0x89/frame 0xfffffe00324f4410 vm_pageout_flush() at vm_pageout_flush+0xed/frame 0xfffffe00324f4510 vm_object_page_collect_flush() at vm_object_page_collect_flush+0x213/frame 0xfffffe00324f4680 vm_object_page_clean() at vm_object_page_clean+0x156/frame 0xfffffe00324f46f0 vm_object_terminate() at vm_object_terminate+0xb0/frame 0xfffffe00324f4740 zfs_freebsd_reclaim() at zfs_freebsd_reclaim+0x1e/frame 0xfffffe00324f4770 VOP_RECLAIM_APV() at VOP_RECLAIM_APV+0x7e/frame 0xfffffe00324f47a0 vgonel() at vgonel+0x202/frame 0xfffffe00324f4820 vrecycle() at vrecycle+0x4a/frame 0xfffffe00324f4850 zfs_freebsd_inactive() at zfs_freebsd_inactive+0xd/frame 0xfffffe00324f4860 VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0x7e/frame 0xfffffe00324f4890 vinactive() at vinactive+0xff/frame 0xfffffe00324f48f0 vputx() at vputx+0x2b7/frame 0xfffffe00324f4950 vm_object_deallocate() at vm_object_deallocate+0x69c/frame 0xfffffe00324f49b0 _vm_map_unlock() at _vm_map_unlock+0xd9/frame 0xfffffe00324f49d0 vm_map_remove() at vm_map_remove+0xb2/frame 0xfffffe00324f4a00 vmspace_exit() at vmspace_exit+0xcd/frame 0xfffffe00324f4a40 exit1() at exit1+0x599/frame 0xfffffe00324f4ab0 sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00324f4ac0 amd64_syscall() at amd64_syscall+0x34b/frame 0xfffffe00324f4bf0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00324f4bf0 --- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80d2e5e5a, rsp = 0x7fffdfbfb058, rbp = 0x7fffdfbfb060 --- (In reply to Pawel Jakub Dawidek from comment #53) This looks like a different new issue. Why pile it here? (53 comments thus far) Why not open a new bug report with full details? (In reply to Andriy Gapon from comment #54) Indeed, it looks like a different bug. It is also not clear which version of FreeBSD is affected. The presence of vm_object_pip_wait() in one of the stacks makes it look like this is HEAD. The original issue appears to be solved, if you have a similar issue, please open a new bug. |