Bug 258208 - [zfs] locks up when using rollback or destroy on both 13.0-RELEASE & sysutils/openzfs port
Summary: [zfs] locks up when using rollback or destroy on both 13.0-RELEASE & sysutils...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-02 10:10 UTC by Dave Cottlehuber
Modified: 2021-10-19 01:00 UTC (History)
4 users (show)

See Also:


Attachments
procstat-kk-a.log (565.36 KB, text/plain)
2021-09-17 10:18 UTC, Dave Cottlehuber
no flags Details
old ZFS VOP_LOCK work (11.63 KB, patch)
2021-10-15 09:30 UTC, Andriy Gapon
no flags Details | Diff
old ZFS VOP_LOCK work (10.85 KB, patch)
2021-10-15 09:39 UTC, Andriy Gapon
no flags Details | Diff
first attempt (5.38 KB, patch)
2021-10-18 21:24 UTC, Mark Johnston
no flags Details | Diff
second attempt (5.64 KB, patch)
2021-10-19 01:00 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Cottlehuber freebsd_committer 2021-09-02 10:10:42 UTC
zfs operations such as rollback or destroy deadlock FreeBSD. Subsequent commands
such as `mount -p` or `bectl list` also hang. writing data as files still works.
dmesg, syslog etc are all empty.

## environment

tried under "default" 13.0-RELEASE-p3 zfs, and also under "kmod"

zfs-2.1.99-1
zfs-kmod-v2021073000-zfs_7eebcd2be

I will build CURRENT with debug, re-try this, and report back.

## pools

embiggen/koans dataset (zpool of 4 drive striped mirrors)
envy (nvme zpool)

  pool: embiggen
 state: ONLINE
  scan: scrub repaired 0B in 02:42:41 with 0 errors on Thu Aug 26 18:22:44 2021
config:

	NAME          STATE     READ WRITE CKSUM
	embiggen      ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    gpt/zfs0  ONLINE       0     0     0
	    gpt/zfs1  ONLINE       0     0     0
	  mirror-1    ONLINE       0     0     0
	    gpt/zfs2  ONLINE       0     0     0
	    gpt/zfs3  ONLINE       0     0     0

errors: No known data errors

  pool: envy
 state: ONLINE
  scan: scrub repaired 0B in 00:07:20 with 0 errors on Thu Aug 26 15:47:42 2021
config:

	NAME        STATE     READ WRITE CKSUM
	envy        ONLINE       0     0     0
	  gpt/envy  ONLINE       0     0     0

errors: No known data errors


## default 13.0-RELEASE zfs

- the parent process is not a zombie process
- not writing a coredump
- can't be attached to from lldb etc
- won't respond to `kill -CONT ...`

```
$ top -SjwbHzPp 47255
last pid: 20225;  load averages:  0.78,  0.81,  0.70  up 2+02:11:14    14:14:38
3744 threads:  10 running, 3693 sleeping, 6 stopped, 35 waiting
CPU 0:  2.6% user,  0.8% nice,  1.0% system,  0.0% interrupt, 95.6% idle
CPU 1:  3.0% user,  0.9% nice,  1.1% system,  0.6% interrupt, 94.3% idle
CPU 2:  3.2% user,  1.0% nice,  1.2% system,  0.0% interrupt, 94.6% idle
CPU 3:  3.3% user,  1.0% nice,  1.2% system,  0.0% interrupt, 94.6% idle
CPU 4:  3.4% user,  1.0% nice,  1.2% system,  0.1% interrupt, 94.4% idle
CPU 5:  3.5% user,  1.0% nice,  1.2% system,  0.1% interrupt, 94.3% idle
CPU 6:  3.5% user,  1.0% nice,  1.3% system,  1.1% interrupt, 93.0% idle
CPU 7:  3.4% user,  1.0% nice,  1.2% system,  0.0% interrupt, 94.3% idle
Mem: 2489M Active, 20G Inact, 472M Laundry, 66G Wired, 8074K Buf, 35G Free
ARC: 51G Total, 34G MFU, 9441M MRU, 15M Anon, 518M Header, 7401M Other
     36G Compressed, 69G Uncompressed, 1.93:1 Ratio
Swap: 252G Total, 252G Free

  PID   JID USERNAME    PRI NICE   SIZE    RES SWAP STATE    C   TIME    WCPU COMMAND
47255     0 dch          20    0  1316M   111M   0B STOP     4   0:00   0.00% beam.smp{10_dirty_io_sch}
47255     0 dch          20    0  1316M   111M   0B STOP     4   0:00   0.00% beam.smp{sys_sig_dispatc}

$ ps augx -p 47255
USER   PID %CPU %MEM     VSZ    RSS TT  STAT STARTED    TIME COMMAND
dch  47255  0.0  0.1 1347516 113316  1  TX   13:57   0:03.48 /usr/local/lib/erlang24/erts-12.0.3/bin/beam.smp -- -root /usr/local/lib/erlang24 -progname erl

$ sudo procstat -kk 47255
  PID    TID COMM                TDNAME              KSTACK                       
47255 574625 beam.smp            sys_sig_dispatc     mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sys_sys_exit+0xd amd64_syscall+0x10c fast_syscall_common+0xf8 
47255 574653 beam.smp            10_dirty_io_sch     mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_lookup+0x7e zfs_freebsd_cachedlookup+0x6b vfs_cache_lookup+0xad lookup+0x68c namei+0x487 kern_statat+0xcf sys_fstatat+0x2f amd64_syscall+0x10c fast_syscall_common+0xf8 
```

This situation repeats after reboot, and shows a hanging zfs rollback or similar command each time:


```
/sbin/zfs rollback -r embiggen/...@pristine
18990 168480 zfs                 -                   mi_switch+0xc1 _vm_page_busy_sleep+0x100 vm_page_sleep_if_busy+0x28 vm_object_page_remove+0xdf vn_pages_remove+0x4c zfs_rezget+0x35 zfs_resume_fs+0x258 zfs_ioc_rollback+0x158 zfsdev_ioctl_common+0x4e3 zfsdev_ioctl+0x143 devfs_ioctl+0xc7 vn_ioctl+0x1a4 devfs_ioctl_f+0x1e kern_ioctl+0x26d sys_ioctl+0xf6 amd64_syscall+0x10c fast_syscall_common+0xf8 
```
Comment 1 Dave Cottlehuber freebsd_committer 2021-09-02 10:15:13 UTC
## zfs-kmod-v2021073000-zfs_7eebcd2be

sudo zfs destroy -vrRf 'embiggen/koans'
will destroy embiggen/koans/cache/koan-ci/6253a9f1f8df5d292b0c675fafde7351c89d4e76587b8de7d7b76e699acdccf6
load: 0.29  cmd: sudo 95533 [select] 1104.21r 0.00u 0.01s 0% 8120k
mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 
^C
load: 0.44  cmd: sudo 95533 [select] 2002.06r 0.00u 0.01s 0% 8120k
mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 
load: 0.48  cmd: sudo 95533 [select] 3931.90r 0.00u 0.01s 0% 8120k
mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 

## subsequent commands hang
# bectl list
^C
load: 0.46  cmd: bectl 97606 [zfs teardown] 1221.12r 0.00u 0.01s 0% 8056k
mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_statfs+0x32 kern_getfsstat+0x202 sys_getfsstat+0x22 amd64_syscall+0x10c fast_syscall_common+0xf8 
```

## zpool events -vf

From last "sync event" prior to the hanging destroy

```

Sep  2 2021 08:39:43.644450230 sysevent.fs.zfs.config_sync
        version = 0x0
        class = "sysevent.fs.zfs.config_sync"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        time = 0x61308dcf 0x266987b6 
        eid = 0x9

Sep  2 2021 08:42:38.680064715 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "parent=0e6074eb51ffe7c5d478cb154327221d0355248e"
        history_internal_name = "clone swap"
        history_dsid = 0x1ea
        history_txg = 0x1019819
        history_time = 0x61308e7e
        time = 0x61308e7e 0x2888f6cb 
        eid = 0xa

Sep  2 2021 08:42:38.698068059 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "(bptree, mintxg=16881555)"
        history_internal_name = "destroy"
        history_dsid = 0x1ea
        history_txg = 0x1019819
        history_time = 0x61308e7e
        time = 0x61308e7e 0x299bac5b 
        eid = 0xb

Sep  2 2021 08:42:39.009063859 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "parent=0e6074eb51ffe7c5d478cb154327221d0355248e"
        history_internal_name = "clone swap"
        history_dsid = 0x1f8
        history_txg = 0x101981e
        history_time = 0x61308e7f
        time = 0x61308e7f 0x8a4db3 
        eid = 0xc

Sep  2 2021 08:42:39.011064352 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "(bptree, mintxg=16881555)"
        history_internal_name = "destroy"
        history_dsid = 0x1f8
        history_txg = 0x101981e
        history_time = 0x61308e7f
        time = 0x61308e7f 0xa8d420 
        eid = 0xd

Sep  2 2021 08:42:41.355063329 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/cache/h2o/338ff637ff357ba38882e44d01251c511523ab67f2be5827b25d9de423544eb3@c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_str = " "
        history_internal_name = "snapshot"
        history_dsid = 0x3b4
        history_txg = 0x101981f
        history_time = 0x61308e81
        time = 0x61308e81 0x1529d621 
        eid = 0xe

Sep  2 2021 08:42:41.496063945 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_str = "origin=embiggen/koans/templates/12.2-RELEASE@pristine (35650)"
        history_internal_name = "clone"
        history_dsid = 0x1ea
        history_txg = 0x1019820
        history_time = 0x61308e81
        time = 0x61308e81 0x1d9155c9 
        eid = 0xf

Sep  2 2021 08:42:41.648063789 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8@pristine"
        history_internal_str = " "
        history_internal_name = "snapshot"
        history_dsid = 0x328
        history_txg = 0x1019821
        history_time = 0x61308e81
        time = 0x61308e81 0x26a0ab2d 
        eid = 0x10

Sep  2 2021 08:42:41.923063860 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_name = "clone swap"
        history_dsid = 0x27d
        history_txg = 0x1019826
        history_time = 0x61308e81
        time = 0x61308e81 0x3704d634 
        eid = 0x11

Sep  2 2021 08:42:41.926063583 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "(bptree, mintxg=16881697)"
        history_internal_name = "destroy"
        history_dsid = 0x27d
        history_txg = 0x1019826
        history_time = 0x61308e81
        time = 0x61308e81 0x37329bdf 
        eid = 0x12

Sep  2 2021 08:42:44.748063744 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_name = "clone swap"
        history_dsid = 0x3bb
        history_txg = 0x101982b
        history_time = 0x61308e84
        time = 0x61308e84 0x2c968c00 
        eid = 0x13

Sep  2 2021 08:42:44.751063792 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "(bptree, mintxg=16881697)"
        history_internal_name = "destroy"
        history_dsid = 0x3bb
        history_txg = 0x101982b
        history_time = 0x61308e84
        time = 0x61308e84 0x2cc452f0 
        eid = 0x14

Sep  2 2021 08:42:44.968063430 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_name = "clone swap"
        history_dsid = 0x71
        history_txg = 0x1019830
        history_time = 0x61308e84
        time = 0x61308e84 0x39b379c6 
        eid = 0x15

Sep  2 2021 08:42:44.970063949 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "(bptree, mintxg=16881697)"
        history_internal_name = "destroy"
        history_dsid = 0x71
        history_txg = 0x1019830
        history_time = 0x61308e84
        time = 0x61308e84 0x39d2004d 
        eid = 0x16
```

and then nothing for 30 minutes.
Comment 2 Mark Millard 2021-09-02 21:43:26 UTC
In a simpler context I do not see the problem for destroy (so: compare/contrast with your context).

Context:

# zpool status
  pool: zopt0
 state: ONLINE
  scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Jun 19 18:23:07 2021
config:

	NAME        STATE     READ WRITE CKSUM
	zopt0       ONLINE       0     0     0
	  nda0p3    ONLINE       0     0     0

errors: No known data errors

# uname -apKU
FreeBSD CA72_16Gp_ZFS 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #4 releng/13.0-n244760-940681634ee1-dirty: Mon Aug 30 11:35:45 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/13_0R-CA72-nodbg-clang/usr/13_0R-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1300139 1300139

Destroy related activity (of prior cross-build materials):

# zfs list | grep amd64
zopt0/BUILDs/13S-amd64-nodbg-clang    7.72G   167G     7.72G  /usr/obj/BUILDs/13S-amd64-nodbg-clang
zopt0/BUILDs/13_0R-amd64-nodbg-clang  7.81G   167G     7.81G  /usr/obj/BUILDs/13_0R-amd64-nodbg-clang
zopt0/BUILDs/main-amd64-nodbg-clang   7.98G   167G     7.98G  /usr/obj/BUILDs/main-amd64-nodbg-clang
zopt0/DESTDIRs/13_0R-amd64-poud         96K   167G       96K  /usr/obj/DESTDIRs/13_0R-amd64-poud
zopt0/DESTDIRs/main-amd64-poud          96K   167G       96K  /usr/obj/DESTDIRs/main-amd64-poud

# zfs destroy -vrRf zopt0/DESTDIRs/main-amd64-poud 
will destroy zopt0/DESTDIRs/main-amd64-poud

# zfs destroy -vrRf zopt0/DESTDIRs/13_0R-amd64-poud 
will destroy zopt0/DESTDIRs/13_0R-amd64-poud

# zfs destroy -vrRf zopt0/BUILDs/main-amd64-nodbg-clang
will destroy zopt0/BUILDs/main-amd64-nodbg-clang

# zfs destroy -vrRf zopt0/BUILDs/13_0R-amd64-nodbg-clang
will destroy zopt0/BUILDs/13_0R-amd64-nodbg-clang

# zfs destroy -vrRf zopt0/BUILDs/13S-amd64-nodbg-clang
will destroy zopt0/BUILDs/13S-amd64-nodbg-clang

# zfs list | grep amd64
#

(I added the blank lines to make reading easier.)
Comment 3 Andriy Gapon freebsd_committer 2021-09-03 05:04:48 UTC
Please attach full procstat -kk -a output.
Comment 4 Dave Cottlehuber freebsd_committer 2021-09-17 10:12:41 UTC
further investigation suggests this is related to devfs mounts as well, i.e. it may not be directly zfs, or zfs at all.

I can trigger it by mounting ~ 90+ devfs on cloned zfs filesystems, and then trying just one more clone & *boom*.

Almost ready to have a simplified test case.
Comment 5 Dave Cottlehuber freebsd_committer 2021-09-17 10:18:36 UTC
Created attachment 227959 [details]
procstat-kk-a.log
Comment 6 Mark Johnston freebsd_committer 2021-09-17 13:21:52 UTC
(In reply to Dave Cottlehuber from comment #5)
So there is effectively an LOR here:

   56 152682 zfs                 -                   mi_switch+0xc1 _vm_page_busy_sleep+0x100 vm_page_sleep_if_busy+0x28 vm_object_page_remove+0xdf vn_pages_remove+0x4c zfs_rezget+0x35 zfs_resume_fs+0x258 zfs_ioc_rollback+0x158 zfsdev_ioctl_common+0x4e3 zfsdev_ioctl+0x143 devfs_ioctl+0xc7 vn_ioctl+0x1a4 devfs_ioctl_f+0x1e kern_ioctl+0x26d sys_ioctl+0xf6 amd64_syscall+0x10c fast_syscall_common+0xf8 

  445 168579 ping                -                   mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_freebsd_getpages+0x52 vnode_pager_getpages+0x41 vm_pager_get_pages+0x22 vm_fault+0x49a vm_fault_trap+0x6d trap_pfault+0x1f8 trap+0x3fd calltrap+0x8

The page array passed to zfs_getpages() is busied upon entry, and then we do ZFS_ENTER(zfsvfs).  But zfs_rezget() is called with this lock held and then tries to busy vnode pages.
Comment 7 Mark Johnston freebsd_committer 2021-09-25 15:19:50 UTC
I am not sure how best to fix this.  To elaborate a bit more, the deadlock occurs because a rollback does a suspend/resume of the target fs.  This involves taking the teardown write lock; one thing we do with the lock held is call zfs_rezget() on all vnodes associated with the filesystem, which among other things throws away all data cached in the page cache.  This requires pages to be busied with the ZFS write lock held, so I am inclined to think that zfs_freebsd_getpages() should be responsible for breaking the deadlock as it does in https://cgit.freebsd.org/src/commit/?id=cd32b4f5b79c97b293f7be3fe9ddfc9024f7d734 .

zfs_freebsd_getpages() could perhaps trylock and upon failure return some EAGAIN-like status to ask the fault handler to retry, but I don't see a way to do that - vm_fault_getpages() squashes the error and does not allow the pager to return KERN_RESOURCE_SHORTAGE.

Alternately, zfs_freebsd_getpages() could perhaps wire and unbusy the page upon a trylock failure.  Once it successfully acquires the teardown read lock, it could re-lookup the fault page and compare or re-insert the wired page if necessary.

OTOH I cannot see how this is handled on Linux.  In particular, I do not see how their zfs_rezget() invalidates the page cache.
Comment 8 Andriy Gapon freebsd_committer 2021-09-29 06:37:06 UTC
Another potential solution -- and I have / had a prototype for it (it's for the older FreeBSD ZFS) -- is to have custom lock / unlock vops in ZFS and to enter the teardown lock before acquiring the vnode lock.  This establishes the lock order friendly to zfs_rezget.

There were some tricky details to get that approach working, but I think that I got it working in the end.

The new order between the teardown and the vnode locks also allowed to simplify some code in other vops where currently we have to drop and req-acquire the teardown lock each time we need to get the vnode lock.
Comment 9 Dave Cottlehuber freebsd_committer 2021-10-14 19:43:48 UTC
avg@ can you share that prior work at least? may be helpful
Comment 10 Mark Johnston freebsd_committer 2021-10-15 00:13:43 UTC
(In reply to Andriy Gapon from comment #8)
Sorry, I missed this comment.  I can try to help finish such a patch if it would be useful.  I think hacking zfs_getpages() to replace pages when a trylock of the teardown lock fails also works, but your approach has other benefits.
Comment 11 Andriy Gapon freebsd_committer 2021-10-15 09:30:14 UTC
Created attachment 228714 [details]
old ZFS VOP_LOCK work

Here is the best I can share immediately.
This is from an old branch.
I am not sure how soon I may have time to try to rebase it.
But hope you can find something useful in it.
Comment 12 Andriy Gapon freebsd_committer 2021-10-15 09:37:12 UTC
And the commit message:
Author:     Andriy Gapon <avg@icyb.net.ua>
AuthorDate: Fri Nov 16 12:11:45 2012 +0200

    zfs: introduce zfs_freebsd_lock and zfs_freebsd_unlock...
    
    These vnode operations acquire and release z_teardown_lock in addition
    to the normal operations on the vnode lock.
    This is a half-step towards fixing lock order problems between VM page
    busying "locks" and ZFS internal locks.  In particular, now z_teardown_lock
    is always acquired prior to page busying (the same as the vnode lock).
    Ideally I wanted z_teardown_lock to be acquired before the vnode lock, but
    this is not possible to implement using VOP_LOCK1 approach.
    
    Additional effect of this change is that now all ZFS vnode ops are guaranteed
    to be protected with z_teardown_lock.  Previously some FreeBSD-specific
    operations accessed znodes and zfsvfs without calling ZFS_ENTER first.
    Consequentially, after this change ZFS_ENTER calls in the vnode ops has
    become redundant.  They are still needed in ZFS vfs ops.
    
    This change should fix a deadlock between vn_pages_remove/vnode_pager_setsize
    called from zfs_rezget with z_teardown_lock held and putpages operation
    blocked on z_teardown_lock while entering zfs_write.
    
    This change has also made z_teardown_inactive_lock redundant.
    
    The only remaining LOR between the page busying and ZFS is with ZFS range
    locks.  This should be fixed by introducing VOP_RANGELOCK to FreeBSD VFS,
    placing calls to this operation at strategic places and implementing the
    op for ZFS using the ZFS rangelock.
Comment 13 Andriy Gapon freebsd_committer 2021-10-15 09:39:56 UTC
Created attachment 228715 [details]
old ZFS VOP_LOCK work
Comment 14 Mark Johnston freebsd_committer 2021-10-15 13:12:52 UTC
(In reply to Andriy Gapon from comment #12)
> The only remaining LOR between the page busying and ZFS is with ZFS range
> locks.  This should be fixed by introducing VOP_RANGELOCK to FreeBSD VFS,
> placing calls to this operation at strategic places and implementing the
> op for ZFS using the ZFS rangelock.

Is this the bug fixed by
https://cgit.freebsd.org/src/commit/?id=66b415fb8f9a7bd9f01d063ecebc2c74cfcb10be
?
Comment 15 Andriy Gapon freebsd_committer 2021-10-18 06:14:48 UTC
(In reply to Mark Johnston from comment #14)
I regret not putting more detailed descriptions for several things in that commit message because now, unfortunately, I cannot recall those details.
Your fix very much sounds like it could be it, but I cannot tell for sure if I had the same thing in mind almost a decade ago.
Comment 16 Mark Johnston freebsd_committer 2021-10-18 21:24:39 UTC
Created attachment 228813 [details]
first attempt

Dave, would you be able to try and reproduce the deadlock with this patch applied?  It should apply to and work with any of releng/13.0, stable/13 or main, though I only tested main.
Comment 17 Mark Johnston freebsd_committer 2021-10-19 01:00:17 UTC
Created attachment 228815 [details]
second attempt

Fix a lock leak via VOP_RECLAIM