zfs operations such as rollback or destroy deadlock FreeBSD. Subsequent commands such as `mount -p` or `bectl list` also hang. writing data as files still works. dmesg, syslog etc are all empty. ## environment tried under "default" 13.0-RELEASE-p3 zfs, and also under "kmod" zfs-2.1.99-1 zfs-kmod-v2021073000-zfs_7eebcd2be I will build CURRENT with debug, re-try this, and report back. ## pools embiggen/koans dataset (zpool of 4 drive striped mirrors) envy (nvme zpool) pool: embiggen state: ONLINE scan: scrub repaired 0B in 02:42:41 with 0 errors on Thu Aug 26 18:22:44 2021 config: NAME STATE READ WRITE CKSUM embiggen ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/zfs0 ONLINE 0 0 0 gpt/zfs1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gpt/zfs2 ONLINE 0 0 0 gpt/zfs3 ONLINE 0 0 0 errors: No known data errors pool: envy state: ONLINE scan: scrub repaired 0B in 00:07:20 with 0 errors on Thu Aug 26 15:47:42 2021 config: NAME STATE READ WRITE CKSUM envy ONLINE 0 0 0 gpt/envy ONLINE 0 0 0 errors: No known data errors ## default 13.0-RELEASE zfs - the parent process is not a zombie process - not writing a coredump - can't be attached to from lldb etc - won't respond to `kill -CONT ...` ``` $ top -SjwbHzPp 47255 last pid: 20225; load averages: 0.78, 0.81, 0.70 up 2+02:11:14 14:14:38 3744 threads: 10 running, 3693 sleeping, 6 stopped, 35 waiting CPU 0: 2.6% user, 0.8% nice, 1.0% system, 0.0% interrupt, 95.6% idle CPU 1: 3.0% user, 0.9% nice, 1.1% system, 0.6% interrupt, 94.3% idle CPU 2: 3.2% user, 1.0% nice, 1.2% system, 0.0% interrupt, 94.6% idle CPU 3: 3.3% user, 1.0% nice, 1.2% system, 0.0% interrupt, 94.6% idle CPU 4: 3.4% user, 1.0% nice, 1.2% system, 0.1% interrupt, 94.4% idle CPU 5: 3.5% user, 1.0% nice, 1.2% system, 0.1% interrupt, 94.3% idle CPU 6: 3.5% user, 1.0% nice, 1.3% system, 1.1% interrupt, 93.0% idle CPU 7: 3.4% user, 1.0% nice, 1.2% system, 0.0% interrupt, 94.3% idle Mem: 2489M Active, 20G Inact, 472M Laundry, 66G Wired, 8074K Buf, 35G Free ARC: 51G Total, 34G MFU, 9441M MRU, 15M Anon, 518M Header, 7401M Other 36G Compressed, 69G Uncompressed, 1.93:1 Ratio Swap: 252G Total, 252G Free PID JID USERNAME PRI NICE SIZE RES SWAP STATE C TIME WCPU COMMAND 47255 0 dch 20 0 1316M 111M 0B STOP 4 0:00 0.00% beam.smp{10_dirty_io_sch} 47255 0 dch 20 0 1316M 111M 0B STOP 4 0:00 0.00% beam.smp{sys_sig_dispatc} $ ps augx -p 47255 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND dch 47255 0.0 0.1 1347516 113316 1 TX 13:57 0:03.48 /usr/local/lib/erlang24/erts-12.0.3/bin/beam.smp -- -root /usr/local/lib/erlang24 -progname erl $ sudo procstat -kk 47255 PID TID COMM TDNAME KSTACK 47255 574625 beam.smp sys_sig_dispatc mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sys_sys_exit+0xd amd64_syscall+0x10c fast_syscall_common+0xf8 47255 574653 beam.smp 10_dirty_io_sch mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_lookup+0x7e zfs_freebsd_cachedlookup+0x6b vfs_cache_lookup+0xad lookup+0x68c namei+0x487 kern_statat+0xcf sys_fstatat+0x2f amd64_syscall+0x10c fast_syscall_common+0xf8 ``` This situation repeats after reboot, and shows a hanging zfs rollback or similar command each time: ``` /sbin/zfs rollback -r embiggen/...@pristine 18990 168480 zfs - mi_switch+0xc1 _vm_page_busy_sleep+0x100 vm_page_sleep_if_busy+0x28 vm_object_page_remove+0xdf vn_pages_remove+0x4c zfs_rezget+0x35 zfs_resume_fs+0x258 zfs_ioc_rollback+0x158 zfsdev_ioctl_common+0x4e3 zfsdev_ioctl+0x143 devfs_ioctl+0xc7 vn_ioctl+0x1a4 devfs_ioctl_f+0x1e kern_ioctl+0x26d sys_ioctl+0xf6 amd64_syscall+0x10c fast_syscall_common+0xf8 ```
## zfs-kmod-v2021073000-zfs_7eebcd2be sudo zfs destroy -vrRf 'embiggen/koans' will destroy embiggen/koans/cache/koan-ci/6253a9f1f8df5d292b0c675fafde7351c89d4e76587b8de7d7b76e699acdccf6 load: 0.29 cmd: sudo 95533 [select] 1104.21r 0.00u 0.01s 0% 8120k mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 ^C load: 0.44 cmd: sudo 95533 [select] 2002.06r 0.00u 0.01s 0% 8120k mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 load: 0.48 cmd: sudo 95533 [select] 3931.90r 0.00u 0.01s 0% 8120k mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 ## subsequent commands hang # bectl list ^C load: 0.46 cmd: bectl 97606 [zfs teardown] 1221.12r 0.00u 0.01s 0% 8056k mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_statfs+0x32 kern_getfsstat+0x202 sys_getfsstat+0x22 amd64_syscall+0x10c fast_syscall_common+0xf8 ``` ## zpool events -vf From last "sync event" prior to the hanging destroy ``` Sep 2 2021 08:39:43.644450230 sysevent.fs.zfs.config_sync version = 0x0 class = "sysevent.fs.zfs.config_sync" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 time = 0x61308dcf 0x266987b6 eid = 0x9 Sep 2 2021 08:42:38.680064715 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback" history_internal_str = "parent=0e6074eb51ffe7c5d478cb154327221d0355248e" history_internal_name = "clone swap" history_dsid = 0x1ea history_txg = 0x1019819 history_time = 0x61308e7e time = 0x61308e7e 0x2888f6cb eid = 0xa Sep 2 2021 08:42:38.698068059 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback" history_internal_str = "(bptree, mintxg=16881555)" history_internal_name = "destroy" history_dsid = 0x1ea history_txg = 0x1019819 history_time = 0x61308e7e time = 0x61308e7e 0x299bac5b eid = 0xb Sep 2 2021 08:42:39.009063859 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback" history_internal_str = "parent=0e6074eb51ffe7c5d478cb154327221d0355248e" history_internal_name = "clone swap" history_dsid = 0x1f8 history_txg = 0x101981e history_time = 0x61308e7f time = 0x61308e7f 0x8a4db3 eid = 0xc Sep 2 2021 08:42:39.011064352 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback" history_internal_str = "(bptree, mintxg=16881555)" history_internal_name = "destroy" history_dsid = 0x1f8 history_txg = 0x101981e history_time = 0x61308e7f time = 0x61308e7f 0xa8d420 eid = 0xd Sep 2 2021 08:42:41.355063329 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/cache/h2o/338ff637ff357ba38882e44d01251c511523ab67f2be5827b25d9de423544eb3@c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8" history_internal_str = " " history_internal_name = "snapshot" history_dsid = 0x3b4 history_txg = 0x101981f history_time = 0x61308e81 time = 0x61308e81 0x1529d621 eid = 0xe Sep 2 2021 08:42:41.496063945 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8" history_internal_str = "origin=embiggen/koans/templates/12.2-RELEASE@pristine (35650)" history_internal_name = "clone" history_dsid = 0x1ea history_txg = 0x1019820 history_time = 0x61308e81 time = 0x61308e81 0x1d9155c9 eid = 0xf Sep 2 2021 08:42:41.648063789 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8@pristine" history_internal_str = " " history_internal_name = "snapshot" history_dsid = 0x328 history_txg = 0x1019821 history_time = 0x61308e81 time = 0x61308e81 0x26a0ab2d eid = 0x10 Sep 2 2021 08:42:41.923063860 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback" history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8" history_internal_name = "clone swap" history_dsid = 0x27d history_txg = 0x1019826 history_time = 0x61308e81 time = 0x61308e81 0x3704d634 eid = 0x11 Sep 2 2021 08:42:41.926063583 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback" history_internal_str = "(bptree, mintxg=16881697)" history_internal_name = "destroy" history_dsid = 0x27d history_txg = 0x1019826 history_time = 0x61308e81 time = 0x61308e81 0x37329bdf eid = 0x12 Sep 2 2021 08:42:44.748063744 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback" history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8" history_internal_name = "clone swap" history_dsid = 0x3bb history_txg = 0x101982b history_time = 0x61308e84 time = 0x61308e84 0x2c968c00 eid = 0x13 Sep 2 2021 08:42:44.751063792 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback" history_internal_str = "(bptree, mintxg=16881697)" history_internal_name = "destroy" history_dsid = 0x3bb history_txg = 0x101982b history_time = 0x61308e84 time = 0x61308e84 0x2cc452f0 eid = 0x14 Sep 2 2021 08:42:44.968063430 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback" history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8" history_internal_name = "clone swap" history_dsid = 0x71 history_txg = 0x1019830 history_time = 0x61308e84 time = 0x61308e84 0x39b379c6 eid = 0x15 Sep 2 2021 08:42:44.970063949 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "embiggen" pool_guid = 0xb2c159b5522179d6 pool_state = 0x0 pool_context = 0x0 history_hostname = "wintermute.skunkwerks.at" history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback" history_internal_str = "(bptree, mintxg=16881697)" history_internal_name = "destroy" history_dsid = 0x71 history_txg = 0x1019830 history_time = 0x61308e84 time = 0x61308e84 0x39d2004d eid = 0x16 ``` and then nothing for 30 minutes.
In a simpler context I do not see the problem for destroy (so: compare/contrast with your context). Context: # zpool status pool: zopt0 state: ONLINE scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Jun 19 18:23:07 2021 config: NAME STATE READ WRITE CKSUM zopt0 ONLINE 0 0 0 nda0p3 ONLINE 0 0 0 errors: No known data errors # uname -apKU FreeBSD CA72_16Gp_ZFS 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #4 releng/13.0-n244760-940681634ee1-dirty: Mon Aug 30 11:35:45 PDT 2021 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/13_0R-CA72-nodbg-clang/usr/13_0R-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1300139 1300139 Destroy related activity (of prior cross-build materials): # zfs list | grep amd64 zopt0/BUILDs/13S-amd64-nodbg-clang 7.72G 167G 7.72G /usr/obj/BUILDs/13S-amd64-nodbg-clang zopt0/BUILDs/13_0R-amd64-nodbg-clang 7.81G 167G 7.81G /usr/obj/BUILDs/13_0R-amd64-nodbg-clang zopt0/BUILDs/main-amd64-nodbg-clang 7.98G 167G 7.98G /usr/obj/BUILDs/main-amd64-nodbg-clang zopt0/DESTDIRs/13_0R-amd64-poud 96K 167G 96K /usr/obj/DESTDIRs/13_0R-amd64-poud zopt0/DESTDIRs/main-amd64-poud 96K 167G 96K /usr/obj/DESTDIRs/main-amd64-poud # zfs destroy -vrRf zopt0/DESTDIRs/main-amd64-poud will destroy zopt0/DESTDIRs/main-amd64-poud # zfs destroy -vrRf zopt0/DESTDIRs/13_0R-amd64-poud will destroy zopt0/DESTDIRs/13_0R-amd64-poud # zfs destroy -vrRf zopt0/BUILDs/main-amd64-nodbg-clang will destroy zopt0/BUILDs/main-amd64-nodbg-clang # zfs destroy -vrRf zopt0/BUILDs/13_0R-amd64-nodbg-clang will destroy zopt0/BUILDs/13_0R-amd64-nodbg-clang # zfs destroy -vrRf zopt0/BUILDs/13S-amd64-nodbg-clang will destroy zopt0/BUILDs/13S-amd64-nodbg-clang # zfs list | grep amd64 # (I added the blank lines to make reading easier.)
Please attach full procstat -kk -a output.
further investigation suggests this is related to devfs mounts as well, i.e. it may not be directly zfs, or zfs at all. I can trigger it by mounting ~ 90+ devfs on cloned zfs filesystems, and then trying just one more clone & *boom*. Almost ready to have a simplified test case.
Created attachment 227959 [details] procstat-kk-a.log
(In reply to Dave Cottlehuber from comment #5) So there is effectively an LOR here: 56 152682 zfs - mi_switch+0xc1 _vm_page_busy_sleep+0x100 vm_page_sleep_if_busy+0x28 vm_object_page_remove+0xdf vn_pages_remove+0x4c zfs_rezget+0x35 zfs_resume_fs+0x258 zfs_ioc_rollback+0x158 zfsdev_ioctl_common+0x4e3 zfsdev_ioctl+0x143 devfs_ioctl+0xc7 vn_ioctl+0x1a4 devfs_ioctl_f+0x1e kern_ioctl+0x26d sys_ioctl+0xf6 amd64_syscall+0x10c fast_syscall_common+0xf8 445 168579 ping - mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_freebsd_getpages+0x52 vnode_pager_getpages+0x41 vm_pager_get_pages+0x22 vm_fault+0x49a vm_fault_trap+0x6d trap_pfault+0x1f8 trap+0x3fd calltrap+0x8 The page array passed to zfs_getpages() is busied upon entry, and then we do ZFS_ENTER(zfsvfs). But zfs_rezget() is called with this lock held and then tries to busy vnode pages.
I am not sure how best to fix this. To elaborate a bit more, the deadlock occurs because a rollback does a suspend/resume of the target fs. This involves taking the teardown write lock; one thing we do with the lock held is call zfs_rezget() on all vnodes associated with the filesystem, which among other things throws away all data cached in the page cache. This requires pages to be busied with the ZFS write lock held, so I am inclined to think that zfs_freebsd_getpages() should be responsible for breaking the deadlock as it does in https://cgit.freebsd.org/src/commit/?id=cd32b4f5b79c97b293f7be3fe9ddfc9024f7d734 . zfs_freebsd_getpages() could perhaps trylock and upon failure return some EAGAIN-like status to ask the fault handler to retry, but I don't see a way to do that - vm_fault_getpages() squashes the error and does not allow the pager to return KERN_RESOURCE_SHORTAGE. Alternately, zfs_freebsd_getpages() could perhaps wire and unbusy the page upon a trylock failure. Once it successfully acquires the teardown read lock, it could re-lookup the fault page and compare or re-insert the wired page if necessary. OTOH I cannot see how this is handled on Linux. In particular, I do not see how their zfs_rezget() invalidates the page cache.
Another potential solution -- and I have / had a prototype for it (it's for the older FreeBSD ZFS) -- is to have custom lock / unlock vops in ZFS and to enter the teardown lock before acquiring the vnode lock. This establishes the lock order friendly to zfs_rezget. There were some tricky details to get that approach working, but I think that I got it working in the end. The new order between the teardown and the vnode locks also allowed to simplify some code in other vops where currently we have to drop and req-acquire the teardown lock each time we need to get the vnode lock.
avg@ can you share that prior work at least? may be helpful
(In reply to Andriy Gapon from comment #8) Sorry, I missed this comment. I can try to help finish such a patch if it would be useful. I think hacking zfs_getpages() to replace pages when a trylock of the teardown lock fails also works, but your approach has other benefits.
Created attachment 228714 [details] old ZFS VOP_LOCK work Here is the best I can share immediately. This is from an old branch. I am not sure how soon I may have time to try to rebase it. But hope you can find something useful in it.
And the commit message: Author: Andriy Gapon <avg@icyb.net.ua> AuthorDate: Fri Nov 16 12:11:45 2012 +0200 zfs: introduce zfs_freebsd_lock and zfs_freebsd_unlock... These vnode operations acquire and release z_teardown_lock in addition to the normal operations on the vnode lock. This is a half-step towards fixing lock order problems between VM page busying "locks" and ZFS internal locks. In particular, now z_teardown_lock is always acquired prior to page busying (the same as the vnode lock). Ideally I wanted z_teardown_lock to be acquired before the vnode lock, but this is not possible to implement using VOP_LOCK1 approach. Additional effect of this change is that now all ZFS vnode ops are guaranteed to be protected with z_teardown_lock. Previously some FreeBSD-specific operations accessed znodes and zfsvfs without calling ZFS_ENTER first. Consequentially, after this change ZFS_ENTER calls in the vnode ops has become redundant. They are still needed in ZFS vfs ops. This change should fix a deadlock between vn_pages_remove/vnode_pager_setsize called from zfs_rezget with z_teardown_lock held and putpages operation blocked on z_teardown_lock while entering zfs_write. This change has also made z_teardown_inactive_lock redundant. The only remaining LOR between the page busying and ZFS is with ZFS range locks. This should be fixed by introducing VOP_RANGELOCK to FreeBSD VFS, placing calls to this operation at strategic places and implementing the op for ZFS using the ZFS rangelock.
Created attachment 228715 [details] old ZFS VOP_LOCK work
(In reply to Andriy Gapon from comment #12) > The only remaining LOR between the page busying and ZFS is with ZFS range > locks. This should be fixed by introducing VOP_RANGELOCK to FreeBSD VFS, > placing calls to this operation at strategic places and implementing the > op for ZFS using the ZFS rangelock. Is this the bug fixed by https://cgit.freebsd.org/src/commit/?id=66b415fb8f9a7bd9f01d063ecebc2c74cfcb10be ?
(In reply to Mark Johnston from comment #14) I regret not putting more detailed descriptions for several things in that commit message because now, unfortunately, I cannot recall those details. Your fix very much sounds like it could be it, but I cannot tell for sure if I had the same thing in mind almost a decade ago.
Created attachment 228813 [details] first attempt Dave, would you be able to try and reproduce the deadlock with this patch applied? It should apply to and work with any of releng/13.0, stable/13 or main, though I only tested main.
Created attachment 228815 [details] second attempt Fix a lock leak via VOP_RECLAIM
Some stress testing revealed a couple of problems with the original approach of overriding vop_lock: 1. rms read locks don't permit recursion. This isn't really obvious since nothing asserts that, rms locks need to use witness. 2. Even if reader recursion is permitted, deadlocks can occur when threads lock multiple vnodes (as can happen during lookup). In particular, thread A may be blocked on reader thread B while trying to acquire the teardown write lock, and thread B may be blocked on a vnode lock owned by thread C, which is blocked from acquiring the teardown read lock by thread A. So we'd need some deadlock avoidance mechanism there, or a different approach.
(In reply to Mark Johnston from comment #18) Now that you found this I also belatedly recalled that I ran into the same kind of a problem... I think that getting the teardown lock before the vnode locks would help with the problem like that. Unfortunately, that's not possible (?) to arrange via VOP_LOCK because of the interlock which is not sleepable. I was leaning towards the idea that ZFS should somehow hook the teardown lock into vn_start_write() or something like that. But for ZFS we would also need vn_start_read() as well. And it's not easy to sprinkle such calls in all places where they are needed... Just in case, here is the 3-way deadlock that I saw: (kgdb) tid 102522 (kgdb) bt #0 sched_switch (td=0xfffff802abd53580, newtd=0xfffff80008d16580, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2005 #1 0xffffffff806a5091 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:438 #2 0xffffffff806e252c in sleepq_switch (wchan=0xfffffe000ec025c8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:611 #3 0xffffffff806e23d2 in sleepq_wait (wchan=0xfffffe000ec025c8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:690 #4 0xffffffff8063ffa3 in _cv_wait (cvp=0xfffffe000ec025c8, lock=<optimized out>) at /usr/src/sys/kern/kern_condvar.c:144 #5 0xffffffff8035bc6b in rrw_enter_read_impl (rrl=0xfffffe000ec025a8, prio=0, tag=0xfffff8065044cb10, try=0) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:201 #6 0xffffffff8035bbb3 in rrw_enter_read (rrl=<unavailable>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:223 #7 0xffffffff8035c2ff in rrm_enter_read (rrl=<optimized out>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:429 #8 0xffffffff8035c2b1 in rrm_enter (rrl=<unavailable>, rw=<optimized out>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:405 #9 0xffffffff803b81db in zfs_freebsd_lock (ap=0xfffffe090e81d4f8) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:6010 #10 0xffffffff8095608d in VOP_LOCK1_APV (vop=<optimized out>, a=0xfffffe090e81d4f8) at vnode_if.c:2087 #11 0xffffffff8075f8ab in VOP_LOCK1 (vp=<optimized out>, flags=<optimized out>, file=<unavailable>, line=<unavailable>) at ./vnode_if.h:859 #12 0xffffffff8075e755 in _vn_lock (vp=<optimized out>, flags=524544, file=0xffffffff80acfa01 "/usr/src/sys/kern/vfs_subr.c", line=2768) at /usr/src/sys/kern/vfs_vnops.c:1555 #13 0xffffffff8074ef5f in vputx (vp=0xfffff8065044cb10, func=1) at /usr/src/sys/kern/vfs_subr.c:2768 #14 0xffffffff8074edbe in vrele (vp=<unavailable>) at /usr/src/sys/kern/vfs_subr.c:2804 #15 0xffffffff8073a901 in vn_vptocnp (vp=0xfffffe090e81d628, cred=0xfffff800234fe000, buf=<optimized out>, buflen=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:2232 #16 0xffffffff8073a384 in vn_fullpath1 (td=0xfffff802abd53580, vp=0xfffff806cd2063b0, rdir=0xfffff8001bd8bb10, buf=0xfffff801dec4f800 "\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336"..., retbuf=0xfffffe090e81d770, buflen=1007) at /usr/src/sys/kern/vfs_cache.c:2335 #17 0xffffffff8073a62f in vn_fullpath (td=0xfffff802abd53580, vn=0xfffff8060b24cce8, retbuf=0xfffffe090e81d770, freebuf=0xfffffe090e81d778) at /usr/src/sys/kern/vfs_cache.c:2164 #18 0xffffffff80760af4 in vn_fill_kinfo_vnode (vp=0xfffff8060b24cce8, kif=0xfffff80232b49018) at /usr/src/sys/kern/vfs_vnops.c:2353 #19 0xffffffff8064f27c in export_vnode_to_kinfo (vp=0xfffff8060b24cce8, fd=-1, fflags=1, kif=0xfffff80232b49018, flags=1) at /usr/src/sys/kern/kern_descrip.c:3408 #20 0xffffffff8064e9b4 in export_vnode_to_sb (vp=0xfffff8060b24cce8, fd=-1, fflags=1, efbuf=0xfffff80232b49000) at /usr/src/sys/kern/kern_descrip.c:3474 #21 0xffffffff8064e84c in kern_proc_filedesc_out (p=<optimized out>, sb=0xfffffe090e81d888, maxlen=-1, flags=1) at /usr/src/sys/kern/kern_descrip.c:3537 #22 0xffffffff8064f87d in sysctl_kern_proc_filedesc (oidp=<optimized out>, arg1=<optimized out>, arg2=<optimized out>, req=<optimized out>) at /usr/src/sys/kern/kern_descrip.c:3597 #23 0xffffffff806a7d6f in sysctl_root_handler_locked (oid=0xffffffff80d93760 <sysctl___kern_proc_filedesc>, arg1=0xfffffe090e81da9c, arg2=1, req=0xfffffe090e81d9d0, tracker=0xfffffe090e81d940) at /usr/src/sys/kern/kern_sysctl.c:165 #24 0xffffffff806a7545 in sysctl_root (oidp=<optimized out>, arg1=<unavailable>, arg2=1, req=<optimized out>) at /usr/src/sys/kern/kern_sysctl.c:2027 #25 0xffffffff806a7a38 in userland_sysctl (td=<optimized out>, name=0xfffffe090e81da90, namelen=4, old=0x0, oldlenp=<optimized out>, inkernel=<optimized out>, new=<optimized out>, newlen=<optimized out>, retval=<unavailable>, flags=0) at /usr/src/sys/kern/kern_sysctl.c:2122 #26 0xffffffff806a78bf in sys___sysctl (td=0xfffff802abd53580, uap=0xfffff802abd53930) at /usr/src/sys/kern/kern_sysctl.c:2057 #27 0xffffffff808fdecb in syscallenter (td=0xfffff802abd53580) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:132 #28 0xffffffff808fda87 in amd64_syscall (td=0xfffff802abd53580, traced=0) at /usr/src/sys/amd64/amd64/trap.c:915 (kgdb) p rrl->rr_writer->td_tid $2 = 100351 (kgdb) tid 100351 (kgdb) bt #0 sched_switch (td=0xfffff80025288000, newtd=0xfffff806ff236580, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2005 #1 0xffffffff806a5091 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:438 #2 0xffffffff806e252c in sleepq_switch (wchan=0xfffffe000ec032d8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:611 #3 0xffffffff806e23d2 in sleepq_wait (wchan=0xfffffe000ec032d8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:690 #4 0xffffffff8063ffa3 in _cv_wait (cvp=0xfffffe000ec032d8, lock=<optimized out>) at /usr/src/sys/kern/kern_condvar.c:144 #5 0xffffffff8035bde5 in rrw_enter_write (rrl=0xfffffe000ec032b8) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:253 #6 0xffffffff8035c329 in rrm_enter_write (rrl=0xfffffe000ec020e8) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:438 #7 0xffffffff8035c2b8 in rrm_enter (rrl=<unavailable>, rw=<unavailable>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:407 #8 0xffffffff803b364b in zfs_mount (vfsp=0xfffff80043895000) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1715 #9 0xffffffff80747c99 in vfs_domount_update (td=<optimized out>, vp=0xfffff800438b3588, fsflags=268505368, optlist=0xfffffe090e404a80) at /usr/src/sys/kern/vfs_mount.c:976 #10 0xffffffff807454fb in vfs_domount (td=0xfffff80025288000, fstype=<optimized out>, fspath=0xfffff804fe9e41a0 "/usr/obj", fsflags=<optimized out>, optlist=0xfffffe090e404a80) at /usr/src/sys/kern/vfs_mount.c:1132 #11 0xffffffff80744c52 in vfs_donmount (td=0xfffff80025288000, fsflags=<optimized out>, fsoptions=0xfffff8059de54900) at /usr/src/sys/kern/vfs_mount.c:687 #12 0xffffffff80744608 in sys_nmount (td=0xfffff80025288000, uap=<optimized out>) at /usr/src/sys/kern/vfs_mount.c:421 #13 0xffffffff808fdecb in syscallenter (td=0xfffff80025288000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:132 #14 0xffffffff808fda87 in amd64_syscall (td=0xfffff80025288000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:915 (kgdb) tid 102044 (kgdb) bt #0 sched_switch (td=0xfffff806f0cb0580, newtd=0xfffff80008d17000, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2005 #1 0xffffffff806a5091 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:438 #2 0xffffffff806e252c in sleepq_switch (wchan=0xfffff8065044cb78, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:611 #3 0xffffffff806e23d2 in sleepq_wait (wchan=0xfffff8065044cb78, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:690 #4 0xffffffff8067311f in sleeplk (lk=0xfffff8065044cb78, flags=2121728, ilk=<optimized out>, wmesg=0xffffffff80aae927 "zfs", pri=96, timo=51, queue=<optimized out>) at /usr/src/sys/kern/kern_lock.c:288 #5 0xffffffff80671c05 in __lockmgr_args (lk=0xfffff8065044cb78, flags=2121728, ilk=0xfffff8065044cba8, wmesg=<optimized out>, pri=<optimized out>, timo=<optimized out>, file=<optimized out>, line=<optimized out>) at /usr/src/sys/kern/kern_lock.c:873 #6 0xffffffff806711c5 in lockmgr_lock_fast_path (lk=0xfffff8065044cb78, flags=2121728, ilk=0xfffff8065044cba8, file=0xffffffff80acfa01 "/usr/src/sys/kern/vfs_subr.c", line=2606) at /usr/src/sys/kern/kern_lock.c:605 #7 0xffffffff8073d939 in vop_stdlock (ap=<optimized out>) at /usr/src/sys/kern/vfs_default.c:517 #8 0xffffffff803b81e8 in zfs_freebsd_lock (ap=0xfffffe090e8a9588) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5987 #9 0xffffffff8095608d in VOP_LOCK1_APV (vop=<optimized out>, a=0xfffffe090e8a9588) at vnode_if.c:2087 #10 0xffffffff8075f8ab in VOP_LOCK1 (vp=<optimized out>, flags=<optimized out>, file=<unavailable>, line=<unavailable>) at ./vnode_if.h:859 #11 0xffffffff8075e755 in _vn_lock (vp=<optimized out>, flags=2121728, file=0xffffffff80acfa01 "/usr/src/sys/kern/vfs_subr.c", line=2606) at /usr/src/sys/kern/vfs_vnops.c:1555 #12 0xffffffff8074e5af in vget (vp=0xfffff8065044cb10, flags=<optimized out>, td=0xfffff806f0cb0580) at /usr/src/sys/kern/vfs_subr.c:2606 #13 0xffffffff8073719c in cache_lookup (dvp=0xfffff806cd2063b0, vpp=0xfffffe090e8a9a18, cnp=0xfffffe090e8a9a40, tsp=0x0, ticksp=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:1321 #14 0xffffffff8073a0a4 in vfs_cache_lookup (ap=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:2067 #15 0xffffffff803b8391 in zfs_cache_lookup (ap=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4956 #16 0xffffffff8095275a in VOP_LOOKUP_APV (vop=<optimized out>, a=0xfffffe090e8a9720) at vnode_if.c:127 #17 0xffffffff80743779 in VOP_LOOKUP (dvp=<optimized out>, vpp=<optimized out>, cnp=<unavailable>) at ./vnode_if.h:54 #18 0xffffffff80742bce in lookup (ndp=0xfffffe090e8a99b8) at /usr/src/sys/kern/vfs_lookup.c:891 #19 0xffffffff807424a7 in namei (ndp=0xfffffe090e8a99b8) at /usr/src/sys/kern/vfs_lookup.c:453 #20 0xffffffff8075e046 in vn_open_cred (ndp=0xfffffe090e8a99b8, flagp=0xfffffe090e8a9ac4, cmode=0, vn_open_flags=0, cred=0xfffff800234fe000, fp=0xfffff8072ca41b90) at /usr/src/sys/kern/vfs_vnops.c:277 #21 0xffffffff8075de4f in vn_open (ndp=<unavailable>, flagp=<unavailable>, cmode=<unavailable>, fp=<unavailable>) at /usr/src/sys/kern/vfs_vnops.c:180 #22 0xffffffff80757b2f in kern_openat (td=0xfffff806f0cb0580, fd=-100, path=0x800de1a80 <error: Cannot access memory at address 0x800de1a80>, pathseg=UIO_USERSPACE, flags=1, mode=<optimized out>) at /usr/src/sys/kern/vfs_syscalls.c:1082 #23 0xffffffff80757d2b in sys_openat (td=0xfffff806f0cb0580, uap=0xfffff806f0cb0930) at /usr/src/sys/kern/vfs_syscalls.c:1030 #24 0xffffffff808fdecb in syscallenter (td=0xfffff806f0cb0580) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:132 #25 0xffffffff808fda87 in amd64_syscall (td=0xfffff806f0cb0580, traced=0) at /usr/src/sys/amd64/amd64/trap.c:915 Summary: 1. Thread 102044 is doing a lookup, dvp must be already locked and thus the filesystem is read-locked The thread is blocked on the child vp lock. 2. Thread 100351 is doing mount -u and wants to write-lock the filesystem, blocked by thread 102044. 3. Thread 102522 does vrele on the child vp and needs to lock it. The thread has already got the vnode lock is blocked by thread 100351.
(In reply to Andriy Gapon from comment #19) Yeah, I was also wondering about changing the lock order. I think that would fix the deadlock but this is getting kind of hairy. Maybe we could busy the mountpoint while sleeping on the teardown lock, but I'm not sure. Taking a step back: zfs_rezget() is triggering the deadlock by busying page cache pages, which it does so that it can purge cached data which would otherwise become stale when the dataset is resumed. But, it really only needs to purge valid pages, invalid pages won't be mapped, and ZFS marks file pages as valid while holding the teardown lock. The deadlock happens when zfs_rezget() is purging _invalid_ pages that getpages is supposed to fill. So perhaps zfs_rezget() can simply ignore invalid pages. I tried implementing this and it fixes the deadlock in my stress test, which simply runs buildkernel in a loop and simultaneously rolls back the dataset in a loop. This test also uncovered some UAFs, btw: It's maybe a bit too hacky, since it means that we check the valid state of a page without busying it, and only the VM object lock is held. This is ok for now at least: to mark a page valid, the page must be busied, but the object lock _and_ busy lock are needed to mark a page invalid. So if vm_object_page_remove() encounters an invalid page, there is no guarantee that it won't later become valid. For ZFS I think it's safe to assume that vnode pages only transition invalid->valid under the teardown lock, but that seems like a delicate assumption... The only other solution I can see is to add a new VOP to lock a vnode in preparation for a getpages call. This VOP could acquire the teardown lock, so we get a consistent lock order vnode->teardown->busy, and then we don't need to deal with recursion. It's not just the page fault handler which needs it though, at least sendfile and the image activator need it as well.
See https://reviews.freebsd.org/D32931 for an attempt at a fix.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d28af1abf031ee87a478b37180e3f0c518caedf6 commit d28af1abf031ee87a478b37180e3f0c518caedf6 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-11-15 16:44:04 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-11-15 18:01:30 +0000 vm: Add a mode to vm_object_page_remove() which skips invalid pages This will be used to break a deadlock in ZFS between the per-mountpoint teardown lock and page busy locks. In particular, when purging data from the page cache during dataset rollback, we want to avoid blocking on the busy state of invalid pages since the busying thread may be blocked on the teardown lock in zfs_getpages(). Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump __FreeBSD_version so that the OpenZFS port can make use of the new helper. PR: 258208 Reviewed by: avg, kib, sef Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931 sys/kern/vfs_vnops.c | 22 ++++++++++++++++++++++ sys/sys/vnode.h | 2 ++ sys/vm/vm_object.c | 19 +++++++++++++++++++ sys/vm/vm_object.h | 1 + 4 files changed, 44 insertions(+)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=705a6ee2b6112c3a653b2bd68f961a8b5b8071a4 commit 705a6ee2b6112c3a653b2bd68f961a8b5b8071a4 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-11-20 16:21:25 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-11-20 16:21:25 +0000 zfs: Fix a deadlock between page busy and the teardown lock When rolling back a dataset, ZFS has to purge file data resident in the system page cache. To do this, it loops over all vnodes for the mountpoint and calls vn_pages_remove() to purge pages associated with the vnode's VM object. Each page is thus exclusively busied while the dataset's teardown write lock is held. When handling a page fault on a mapped ZFS file, FreeBSD's page fault handler busies newly allocated pages and then uses VOP_GETPAGES to fill them. The ZFS getpages VOP acquires the teardown read lock with vnode pages already busied. This represents a lock order reversal which can lead to deadlock. To break the deadlock, observe that zfs_rezget() need only purge those pages marked valid, and that pages busied by the page fault handler are, by definition, invalid. Furthermore, ZFS pages always transition from invalid to valid with the teardown lock held, and ZFS never creates partially valid pages. Thus, zfs_rezget() can use the new vn_pages_remove_valid() to skip over pages busied by the fault handler. PR: 258208 Tested by: pho Reviewed by: avg, sef, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931 sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++ 1 file changed, 9 insertions(+)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=fdd27db34802decf062339411e5f84993e733be0 commit fdd27db34802decf062339411e5f84993e733be0 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-11-15 16:44:04 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-11-29 14:09:28 +0000 vm: Add a mode to vm_object_page_remove() which skips invalid pages This will be used to break a deadlock in ZFS between the per-mountpoint teardown lock and page busy locks. In particular, when purging data from the page cache during dataset rollback, we want to avoid blocking on the busy state of invalid pages since the busying thread may be blocked on the teardown lock in zfs_getpages(). Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump __FreeBSD_version so that the OpenZFS port can make use of the new helper. PR: 258208 Reviewed by: avg, kib, sef Tested by: pho (part of a larger patch) Sponsored by: The FreeBSD Foundation (cherry picked from commit d28af1abf031ee87a478b37180e3f0c518caedf6) sys/kern/vfs_vnops.c | 22 ++++++++++++++++++++++ sys/sys/vnode.h | 2 ++ sys/vm/vm_object.c | 19 +++++++++++++++++++ sys/vm/vm_object.h | 1 + 4 files changed, 44 insertions(+)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0c5f290ed90963b64b05a2098e0b11c3d9f6471f commit 0c5f290ed90963b64b05a2098e0b11c3d9f6471f Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-11-20 16:21:25 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-12-05 18:38:35 +0000 zfs: Fix a deadlock between page busy and the teardown lock When rolling back a dataset, ZFS has to purge file data resident in the system page cache. To do this, it loops over all vnodes for the mountpoint and calls vn_pages_remove() to purge pages associated with the vnode's VM object. Each page is thus exclusively busied while the dataset's teardown write lock is held. When handling a page fault on a mapped ZFS file, FreeBSD's page fault handler busies newly allocated pages and then uses VOP_GETPAGES to fill them. The ZFS getpages VOP acquires the teardown read lock with vnode pages already busied. This represents a lock order reversal which can lead to deadlock. To break the deadlock, observe that zfs_rezget() need only purge those pages marked valid, and that pages busied by the page fault handler are, by definition, invalid. Furthermore, ZFS pages always transition from invalid to valid with the teardown lock held, and ZFS never creates partially valid pages. Thus, zfs_rezget() can use the new vn_pages_remove_valid() to skip over pages busied by the fault handler. PR: 258208 Tested by: pho Reviewed by: avg, sef, kib Sponsored by: The FreeBSD Foundation (cherry picked from commit 705a6ee2b6112c3a653b2bd68f961a8b5b8071a4) sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++ 1 file changed, 9 insertions(+)
I believe the original problem is now fixed on main and stable/13. There is a stress2 scenario which triggers the deadlock (zfs14.sh , so we now have a nice regression test too.
A commit in branch vendor/openzfs/master references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cdf74673bc7993a02264210b796a77193357966c commit cdf74673bc7993a02264210b796a77193357966c Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-11-20 16:21:25 +0000 Commit: Brian Behlendorf <behlendorf1@llnl.gov> CommitDate: 2021-12-12 19:13:18 +0000 zfs: Fix a deadlock between page busy and the teardown lock When rolling back a dataset, ZFS has to purge file data resident in the system page cache. To do this, it loops over all vnodes for the mountpoint and calls vn_pages_remove() to purge pages associated with the vnode's VM object. Each page is thus exclusively busied while the dataset's teardown write lock is held. When handling a page fault on a mapped ZFS file, FreeBSD's page fault handler busies newly allocated pages and then uses VOP_GETPAGES to fill them. The ZFS getpages VOP acquires the teardown read lock with vnode pages already busied. This represents a lock order reversal which can lead to deadlock. To break the deadlock, observe that zfs_rezget() need only purge those pages marked valid, and that pages busied by the page fault handler are, by definition, invalid. Furthermore, ZFS pages always transition from invalid to valid with the teardown lock held, and ZFS never creates partially valid pages. Thus, zfs_rezget() can use the new vn_pages_remove_valid() to skip over pages busied by the fault handler. PR: 258208 Tested by: pho Reviewed by: avg, sef, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12828 module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++ 1 file changed, 9 insertions(+)
A commit in branch vendor/openzfs/zfs-2.1-release references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b3427b18b1c694f9dbc4673a11a6cf57c7f5c5d8 commit b3427b18b1c694f9dbc4673a11a6cf57c7f5c5d8 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-11-20 16:21:25 +0000 Commit: Tony Hutter <hutter2@llnl.gov> CommitDate: 2022-03-04 23:37:41 +0000 zfs: Fix a deadlock between page busy and the teardown lock When rolling back a dataset, ZFS has to purge file data resident in the system page cache. To do this, it loops over all vnodes for the mountpoint and calls vn_pages_remove() to purge pages associated with the vnode's VM object. Each page is thus exclusively busied while the dataset's teardown write lock is held. When handling a page fault on a mapped ZFS file, FreeBSD's page fault handler busies newly allocated pages and then uses VOP_GETPAGES to fill them. The ZFS getpages VOP acquires the teardown read lock with vnode pages already busied. This represents a lock order reversal which can lead to deadlock. To break the deadlock, observe that zfs_rezget() need only purge those pages marked valid, and that pages busied by the page fault handler are, by definition, invalid. Furthermore, ZFS pages always transition from invalid to valid with the teardown lock held, and ZFS never creates partially valid pages. Thus, zfs_rezget() can use the new vn_pages_remove_valid() to skip over pages busied by the fault handler. PR: 258208 Tested by: pho Reviewed by: avg, sef, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12828 module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++ 1 file changed, 9 insertions(+)
Please: how much of this – pre- and post-fix – might land in releng/13.1? Is releng/13.2 more likely? I do see the two-week MFC for the post-fix commit, so I'm not asking anyone to jump the gun. Just curious (with draft release notes for 13.1 in mind). Thanks
(In reply to Graham Perrin from comment #29) What do you mean by pre- and post-fixes? The commit which fixes the original bug has been in stable/13 for several months and so will be included in releng/13.1.