258208 – [zfs] locks up when using rollback or destroy on both 13.0-RELEASE & sysutils/openzfs port

Bug 258208 - [zfs] locks up when using rollback or destroy on both 13.0-RELEASE & sysutils/openzfs port

Summary: [zfs] locks up when using rollback or destroy on both 13.0-RELEASE & sysutils...

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	13.0-RELEASE
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Mark Johnston

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-09-02 10:10 UTC by Dave Cottlehuber
Modified:	2022-03-14 15:30 UTC (History)
CC List:	4 users (show)

See Also:

Flags:	markj: mfc-stable13+ markj: mfc-stable12-

Attachments
procstat-kk-a.log (565.36 KB, text/plain) 2021-09-17 10:18 UTC, Dave Cottlehuber	no flags	Details
old ZFS VOP_LOCK work (11.63 KB, patch) 2021-10-15 09:30 UTC, Andriy Gapon	no flags	Details \| Diff
old ZFS VOP_LOCK work (10.85 KB, patch) 2021-10-15 09:39 UTC, Andriy Gapon	no flags	Details \| Diff
first attempt (5.38 KB, patch) 2021-10-18 21:24 UTC, Mark Johnston	no flags	Details \| Diff
second attempt (5.64 KB, patch) 2021-10-19 01:00 UTC, Mark Johnston	no flags	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dave Cottlehuber freebsd_committer

2021-09-02 10:10:42 UTC

zfs operations such as rollback or destroy deadlock FreeBSD. Subsequent commands
such as `mount -p` or `bectl list` also hang. writing data as files still works.
dmesg, syslog etc are all empty.

## environment

tried under "default" 13.0-RELEASE-p3 zfs, and also under "kmod"

zfs-2.1.99-1
zfs-kmod-v2021073000-zfs_7eebcd2be

I will build CURRENT with debug, re-try this, and report back.

## pools

embiggen/koans dataset (zpool of 4 drive striped mirrors)
envy (nvme zpool)

  pool: embiggen
 state: ONLINE
  scan: scrub repaired 0B in 02:42:41 with 0 errors on Thu Aug 26 18:22:44 2021
config:

	NAME          STATE     READ WRITE CKSUM
	embiggen      ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    gpt/zfs0  ONLINE       0     0     0
	    gpt/zfs1  ONLINE       0     0     0
	  mirror-1    ONLINE       0     0     0
	    gpt/zfs2  ONLINE       0     0     0
	    gpt/zfs3  ONLINE       0     0     0

errors: No known data errors

  pool: envy
 state: ONLINE
  scan: scrub repaired 0B in 00:07:20 with 0 errors on Thu Aug 26 15:47:42 2021
config:

	NAME        STATE     READ WRITE CKSUM
	envy        ONLINE       0     0     0
	  gpt/envy  ONLINE       0     0     0

errors: No known data errors


## default 13.0-RELEASE zfs

- the parent process is not a zombie process
- not writing a coredump
- can't be attached to from lldb etc
- won't respond to `kill -CONT ...`

```
$ top -SjwbHzPp 47255
last pid: 20225;  load averages:  0.78,  0.81,  0.70  up 2+02:11:14    14:14:38
3744 threads:  10 running, 3693 sleeping, 6 stopped, 35 waiting
CPU 0:  2.6% user,  0.8% nice,  1.0% system,  0.0% interrupt, 95.6% idle
CPU 1:  3.0% user,  0.9% nice,  1.1% system,  0.6% interrupt, 94.3% idle
CPU 2:  3.2% user,  1.0% nice,  1.2% system,  0.0% interrupt, 94.6% idle
CPU 3:  3.3% user,  1.0% nice,  1.2% system,  0.0% interrupt, 94.6% idle
CPU 4:  3.4% user,  1.0% nice,  1.2% system,  0.1% interrupt, 94.4% idle
CPU 5:  3.5% user,  1.0% nice,  1.2% system,  0.1% interrupt, 94.3% idle
CPU 6:  3.5% user,  1.0% nice,  1.3% system,  1.1% interrupt, 93.0% idle
CPU 7:  3.4% user,  1.0% nice,  1.2% system,  0.0% interrupt, 94.3% idle
Mem: 2489M Active, 20G Inact, 472M Laundry, 66G Wired, 8074K Buf, 35G Free
ARC: 51G Total, 34G MFU, 9441M MRU, 15M Anon, 518M Header, 7401M Other
     36G Compressed, 69G Uncompressed, 1.93:1 Ratio
Swap: 252G Total, 252G Free

  PID   JID USERNAME    PRI NICE   SIZE    RES SWAP STATE    C   TIME    WCPU COMMAND
47255     0 dch          20    0  1316M   111M   0B STOP     4   0:00   0.00% beam.smp{10_dirty_io_sch}
47255     0 dch          20    0  1316M   111M   0B STOP     4   0:00   0.00% beam.smp{sys_sig_dispatc}

$ ps augx -p 47255
USER   PID %CPU %MEM     VSZ    RSS TT  STAT STARTED    TIME COMMAND
dch  47255  0.0  0.1 1347516 113316  1  TX   13:57   0:03.48 /usr/local/lib/erlang24/erts-12.0.3/bin/beam.smp -- -root /usr/local/lib/erlang24 -progname erl

$ sudo procstat -kk 47255
  PID    TID COMM                TDNAME              KSTACK                       
47255 574625 beam.smp            sys_sig_dispatc     mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sys_sys_exit+0xd amd64_syscall+0x10c fast_syscall_common+0xf8 
47255 574653 beam.smp            10_dirty_io_sch     mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_lookup+0x7e zfs_freebsd_cachedlookup+0x6b vfs_cache_lookup+0xad lookup+0x68c namei+0x487 kern_statat+0xcf sys_fstatat+0x2f amd64_syscall+0x10c fast_syscall_common+0xf8 
```

This situation repeats after reboot, and shows a hanging zfs rollback or similar command each time:


```
/sbin/zfs rollback -r embiggen/...@pristine
18990 168480 zfs                 -                   mi_switch+0xc1 _vm_page_busy_sleep+0x100 vm_page_sleep_if_busy+0x28 vm_object_page_remove+0xdf vn_pages_remove+0x4c zfs_rezget+0x35 zfs_resume_fs+0x258 zfs_ioc_rollback+0x158 zfsdev_ioctl_common+0x4e3 zfsdev_ioctl+0x143 devfs_ioctl+0xc7 vn_ioctl+0x1a4 devfs_ioctl_f+0x1e kern_ioctl+0x26d sys_ioctl+0xf6 amd64_syscall+0x10c fast_syscall_common+0xf8 
```

Comment 1 Dave Cottlehuber freebsd_committer

2021-09-02 10:15:13 UTC

## zfs-kmod-v2021073000-zfs_7eebcd2be

sudo zfs destroy -vrRf 'embiggen/koans'
will destroy embiggen/koans/cache/koan-ci/6253a9f1f8df5d292b0c675fafde7351c89d4e76587b8de7d7b76e699acdccf6
load: 0.29  cmd: sudo 95533 [select] 1104.21r 0.00u 0.01s 0% 8120k
mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 
^C
load: 0.44  cmd: sudo 95533 [select] 2002.06r 0.00u 0.01s 0% 8120k
mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 
load: 0.48  cmd: sudo 95533 [select] 3931.90r 0.00u 0.01s 0% 8120k
mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _cv_wait_sig+0xe4 seltdwait+0xa9 kern_poll+0x51b sys_ppoll+0x6f amd64_syscall+0x10c fast_syscall_common+0xf8 

## subsequent commands hang
# bectl list
^C
load: 0.46  cmd: bectl 97606 [zfs teardown] 1221.12r 0.00u 0.01s 0% 8056k
mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_statfs+0x32 kern_getfsstat+0x202 sys_getfsstat+0x22 amd64_syscall+0x10c fast_syscall_common+0xf8 
```

## zpool events -vf

From last "sync event" prior to the hanging destroy

```

Sep  2 2021 08:39:43.644450230 sysevent.fs.zfs.config_sync
        version = 0x0
        class = "sysevent.fs.zfs.config_sync"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        time = 0x61308dcf 0x266987b6 
        eid = 0x9

Sep  2 2021 08:42:38.680064715 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "parent=0e6074eb51ffe7c5d478cb154327221d0355248e"
        history_internal_name = "clone swap"
        history_dsid = 0x1ea
        history_txg = 0x1019819
        history_time = 0x61308e7e
        time = 0x61308e7e 0x2888f6cb 
        eid = 0xa

Sep  2 2021 08:42:38.698068059 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "(bptree, mintxg=16881555)"
        history_internal_name = "destroy"
        history_dsid = 0x1ea
        history_txg = 0x1019819
        history_time = 0x61308e7e
        time = 0x61308e7e 0x299bac5b 
        eid = 0xb

Sep  2 2021 08:42:39.009063859 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "parent=0e6074eb51ffe7c5d478cb154327221d0355248e"
        history_internal_name = "clone swap"
        history_dsid = 0x1f8
        history_txg = 0x101981e
        history_time = 0x61308e7f
        time = 0x61308e7f 0x8a4db3 
        eid = 0xc

Sep  2 2021 08:42:39.011064352 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/0e6074eb51ffe7c5d478cb154327221d0355248e/%rollback"
        history_internal_str = "(bptree, mintxg=16881555)"
        history_internal_name = "destroy"
        history_dsid = 0x1f8
        history_txg = 0x101981e
        history_time = 0x61308e7f
        time = 0x61308e7f 0xa8d420 
        eid = 0xd

Sep  2 2021 08:42:41.355063329 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/cache/h2o/338ff637ff357ba38882e44d01251c511523ab67f2be5827b25d9de423544eb3@c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_str = " "
        history_internal_name = "snapshot"
        history_dsid = 0x3b4
        history_txg = 0x101981f
        history_time = 0x61308e81
        time = 0x61308e81 0x1529d621 
        eid = 0xe

Sep  2 2021 08:42:41.496063945 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_str = "origin=embiggen/koans/templates/12.2-RELEASE@pristine (35650)"
        history_internal_name = "clone"
        history_dsid = 0x1ea
        history_txg = 0x1019820
        history_time = 0x61308e81
        time = 0x61308e81 0x1d9155c9 
        eid = 0xf

Sep  2 2021 08:42:41.648063789 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8@pristine"
        history_internal_str = " "
        history_internal_name = "snapshot"
        history_dsid = 0x328
        history_txg = 0x1019821
        history_time = 0x61308e81
        time = 0x61308e81 0x26a0ab2d 
        eid = 0x10

Sep  2 2021 08:42:41.923063860 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_name = "clone swap"
        history_dsid = 0x27d
        history_txg = 0x1019826
        history_time = 0x61308e81
        time = 0x61308e81 0x3704d634 
        eid = 0x11

Sep  2 2021 08:42:41.926063583 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "(bptree, mintxg=16881697)"
        history_internal_name = "destroy"
        history_dsid = 0x27d
        history_txg = 0x1019826
        history_time = 0x61308e81
        time = 0x61308e81 0x37329bdf 
        eid = 0x12

Sep  2 2021 08:42:44.748063744 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_name = "clone swap"
        history_dsid = 0x3bb
        history_txg = 0x101982b
        history_time = 0x61308e84
        time = 0x61308e84 0x2c968c00 
        eid = 0x13

Sep  2 2021 08:42:44.751063792 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "(bptree, mintxg=16881697)"
        history_internal_name = "destroy"
        history_dsid = 0x3bb
        history_txg = 0x101982b
        history_time = 0x61308e84
        time = 0x61308e84 0x2cc452f0 
        eid = 0x14

Sep  2 2021 08:42:44.968063430 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "parent=c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8"
        history_internal_name = "clone swap"
        history_dsid = 0x71
        history_txg = 0x1019830
        history_time = 0x61308e84
        time = 0x61308e84 0x39b379c6 
        eid = 0x15

Sep  2 2021 08:42:44.970063949 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "embiggen"
        pool_guid = 0xb2c159b5522179d6
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "wintermute.skunkwerks.at"
        history_dsname = "embiggen/koans/clones/h2o/c7ee4afccbd2d1c3dfb3310d5afbbd7b4df26bf8/%rollback"
        history_internal_str = "(bptree, mintxg=16881697)"
        history_internal_name = "destroy"
        history_dsid = 0x71
        history_txg = 0x1019830
        history_time = 0x61308e84
        time = 0x61308e84 0x39d2004d 
        eid = 0x16
```

and then nothing for 30 minutes.

Comment 2 Mark Millard 2021-09-02 21:43:26 UTC

In a simpler context I do not see the problem for destroy (so: compare/contrast with your context).

Context:

# zpool status
  pool: zopt0
 state: ONLINE
  scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Jun 19 18:23:07 2021
config:

	NAME        STATE     READ WRITE CKSUM
	zopt0       ONLINE       0     0     0
	  nda0p3    ONLINE       0     0     0

errors: No known data errors

# uname -apKU
FreeBSD CA72_16Gp_ZFS 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #4 releng/13.0-n244760-940681634ee1-dirty: Mon Aug 30 11:35:45 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/13_0R-CA72-nodbg-clang/usr/13_0R-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1300139 1300139

Destroy related activity (of prior cross-build materials):

# zfs list | grep amd64
zopt0/BUILDs/13S-amd64-nodbg-clang    7.72G   167G     7.72G  /usr/obj/BUILDs/13S-amd64-nodbg-clang
zopt0/BUILDs/13_0R-amd64-nodbg-clang  7.81G   167G     7.81G  /usr/obj/BUILDs/13_0R-amd64-nodbg-clang
zopt0/BUILDs/main-amd64-nodbg-clang   7.98G   167G     7.98G  /usr/obj/BUILDs/main-amd64-nodbg-clang
zopt0/DESTDIRs/13_0R-amd64-poud         96K   167G       96K  /usr/obj/DESTDIRs/13_0R-amd64-poud
zopt0/DESTDIRs/main-amd64-poud          96K   167G       96K  /usr/obj/DESTDIRs/main-amd64-poud

# zfs destroy -vrRf zopt0/DESTDIRs/main-amd64-poud 
will destroy zopt0/DESTDIRs/main-amd64-poud

# zfs destroy -vrRf zopt0/DESTDIRs/13_0R-amd64-poud 
will destroy zopt0/DESTDIRs/13_0R-amd64-poud

# zfs destroy -vrRf zopt0/BUILDs/main-amd64-nodbg-clang
will destroy zopt0/BUILDs/main-amd64-nodbg-clang

# zfs destroy -vrRf zopt0/BUILDs/13_0R-amd64-nodbg-clang
will destroy zopt0/BUILDs/13_0R-amd64-nodbg-clang

# zfs destroy -vrRf zopt0/BUILDs/13S-amd64-nodbg-clang
will destroy zopt0/BUILDs/13S-amd64-nodbg-clang

# zfs list | grep amd64
#

(I added the blank lines to make reading easier.)

Comment 3 Andriy Gapon freebsd_committer

2021-09-03 05:04:48 UTC

Please attach full procstat -kk -a output.

Comment 4 Dave Cottlehuber freebsd_committer

2021-09-17 10:12:41 UTC

further investigation suggests this is related to devfs mounts as well, i.e. it may not be directly zfs, or zfs at all.

I can trigger it by mounting ~ 90+ devfs on cloned zfs filesystems, and then trying just one more clone & *boom*.

Almost ready to have a simplified test case.

Comment 5 Dave Cottlehuber freebsd_committer

2021-09-17 10:18:36 UTC

Created attachment 227959 [details]
procstat-kk-a.log

Comment 6 Mark Johnston freebsd_committer

2021-09-17 13:21:52 UTC

(In reply to Dave Cottlehuber from comment #5)
So there is effectively an LOR here:

   56 152682 zfs                 -                   mi_switch+0xc1 _vm_page_busy_sleep+0x100 vm_page_sleep_if_busy+0x28 vm_object_page_remove+0xdf vn_pages_remove+0x4c zfs_rezget+0x35 zfs_resume_fs+0x258 zfs_ioc_rollback+0x158 zfsdev_ioctl_common+0x4e3 zfsdev_ioctl+0x143 devfs_ioctl+0xc7 vn_ioctl+0x1a4 devfs_ioctl_f+0x1e kern_ioctl+0x26d sys_ioctl+0xf6 amd64_syscall+0x10c fast_syscall_common+0xf8 

  445 168579 ping                -                   mi_switch+0xc1 _sleep+0x1cb rms_rlock_fallback+0x90 zfs_freebsd_getpages+0x52 vnode_pager_getpages+0x41 vm_pager_get_pages+0x22 vm_fault+0x49a vm_fault_trap+0x6d trap_pfault+0x1f8 trap+0x3fd calltrap+0x8

The page array passed to zfs_getpages() is busied upon entry, and then we do ZFS_ENTER(zfsvfs).  But zfs_rezget() is called with this lock held and then tries to busy vnode pages.

Comment 7 Mark Johnston freebsd_committer

2021-09-25 15:19:50 UTC

I am not sure how best to fix this.  To elaborate a bit more, the deadlock occurs because a rollback does a suspend/resume of the target fs.  This involves taking the teardown write lock; one thing we do with the lock held is call zfs_rezget() on all vnodes associated with the filesystem, which among other things throws away all data cached in the page cache.  This requires pages to be busied with the ZFS write lock held, so I am inclined to think that zfs_freebsd_getpages() should be responsible for breaking the deadlock as it does in https://cgit.freebsd.org/src/commit/?id=cd32b4f5b79c97b293f7be3fe9ddfc9024f7d734 .

zfs_freebsd_getpages() could perhaps trylock and upon failure return some EAGAIN-like status to ask the fault handler to retry, but I don't see a way to do that - vm_fault_getpages() squashes the error and does not allow the pager to return KERN_RESOURCE_SHORTAGE.

Alternately, zfs_freebsd_getpages() could perhaps wire and unbusy the page upon a trylock failure.  Once it successfully acquires the teardown read lock, it could re-lookup the fault page and compare or re-insert the wired page if necessary.

OTOH I cannot see how this is handled on Linux.  In particular, I do not see how their zfs_rezget() invalidates the page cache.

Comment 8 Andriy Gapon freebsd_committer

2021-09-29 06:37:06 UTC

Another potential solution -- and I have / had a prototype for it (it's for the older FreeBSD ZFS) -- is to have custom lock / unlock vops in ZFS and to enter the teardown lock before acquiring the vnode lock.  This establishes the lock order friendly to zfs_rezget.

There were some tricky details to get that approach working, but I think that I got it working in the end.

The new order between the teardown and the vnode locks also allowed to simplify some code in other vops where currently we have to drop and req-acquire the teardown lock each time we need to get the vnode lock.

Comment 9 Dave Cottlehuber freebsd_committer

2021-10-14 19:43:48 UTC

avg@ can you share that prior work at least? may be helpful

Comment 10 Mark Johnston freebsd_committer

2021-10-15 00:13:43 UTC

(In reply to Andriy Gapon from comment #8)
Sorry, I missed this comment.  I can try to help finish such a patch if it would be useful.  I think hacking zfs_getpages() to replace pages when a trylock of the teardown lock fails also works, but your approach has other benefits.

Comment 11 Andriy Gapon freebsd_committer

2021-10-15 09:30:14 UTC

Created attachment 228714 [details]
old ZFS VOP_LOCK work

Here is the best I can share immediately.
This is from an old branch.
I am not sure how soon I may have time to try to rebase it.
But hope you can find something useful in it.

Comment 12 Andriy Gapon freebsd_committer

2021-10-15 09:37:12 UTC

And the commit message:
Author:     Andriy Gapon <avg@icyb.net.ua>
AuthorDate: Fri Nov 16 12:11:45 2012 +0200

    zfs: introduce zfs_freebsd_lock and zfs_freebsd_unlock...
    
    These vnode operations acquire and release z_teardown_lock in addition
    to the normal operations on the vnode lock.
    This is a half-step towards fixing lock order problems between VM page
    busying "locks" and ZFS internal locks.  In particular, now z_teardown_lock
    is always acquired prior to page busying (the same as the vnode lock).
    Ideally I wanted z_teardown_lock to be acquired before the vnode lock, but
    this is not possible to implement using VOP_LOCK1 approach.
    
    Additional effect of this change is that now all ZFS vnode ops are guaranteed
    to be protected with z_teardown_lock.  Previously some FreeBSD-specific
    operations accessed znodes and zfsvfs without calling ZFS_ENTER first.
    Consequentially, after this change ZFS_ENTER calls in the vnode ops has
    become redundant.  They are still needed in ZFS vfs ops.
    
    This change should fix a deadlock between vn_pages_remove/vnode_pager_setsize
    called from zfs_rezget with z_teardown_lock held and putpages operation
    blocked on z_teardown_lock while entering zfs_write.
    
    This change has also made z_teardown_inactive_lock redundant.
    
    The only remaining LOR between the page busying and ZFS is with ZFS range
    locks.  This should be fixed by introducing VOP_RANGELOCK to FreeBSD VFS,
    placing calls to this operation at strategic places and implementing the
    op for ZFS using the ZFS rangelock.

Comment 13 Andriy Gapon freebsd_committer

2021-10-15 09:39:56 UTC

Created attachment 228715 [details]
old ZFS VOP_LOCK work

Comment 14 Mark Johnston freebsd_committer

2021-10-15 13:12:52 UTC

(In reply to Andriy Gapon from comment #12)
> The only remaining LOR between the page busying and ZFS is with ZFS range
> locks.  This should be fixed by introducing VOP_RANGELOCK to FreeBSD VFS,
> placing calls to this operation at strategic places and implementing the
> op for ZFS using the ZFS rangelock.

Is this the bug fixed by
https://cgit.freebsd.org/src/commit/?id=66b415fb8f9a7bd9f01d063ecebc2c74cfcb10be
?

Comment 15 Andriy Gapon freebsd_committer

2021-10-18 06:14:48 UTC

(In reply to Mark Johnston from comment #14)
I regret not putting more detailed descriptions for several things in that commit message because now, unfortunately, I cannot recall those details.
Your fix very much sounds like it could be it, but I cannot tell for sure if I had the same thing in mind almost a decade ago.

Comment 16 Mark Johnston freebsd_committer

2021-10-18 21:24:39 UTC

Created attachment 228813 [details]
first attempt

Dave, would you be able to try and reproduce the deadlock with this patch applied?  It should apply to and work with any of releng/13.0, stable/13 or main, though I only tested main.

Comment 17 Mark Johnston freebsd_committer

2021-10-19 01:00:17 UTC

Created attachment 228815 [details]
second attempt

Fix a lock leak via VOP_RECLAIM

Comment 18 Mark Johnston freebsd_committer

2021-10-27 21:01:12 UTC

Some stress testing revealed a couple of problems with the original approach of overriding vop_lock:
1. rms read locks don't permit recursion.  This isn't really obvious since nothing asserts that, rms locks need to use witness.
2. Even if reader recursion is permitted, deadlocks can occur when threads lock multiple vnodes (as can happen during lookup).  In particular, thread A may be blocked on reader thread B while trying to acquire the teardown write lock, and thread B may be blocked on a vnode lock owned by thread C, which is blocked from acquiring the teardown read lock by thread A.  So we'd need some deadlock avoidance mechanism there, or a different approach.

Comment 19 Andriy Gapon freebsd_committer

2021-10-28 09:43:19 UTC

(In reply to Mark Johnston from comment #18)
Now that you found this I also belatedly recalled that I ran into the same kind of a problem...

I think that getting the teardown lock before the vnode locks would help with the problem like that.  Unfortunately, that's not possible (?) to arrange via VOP_LOCK because of the interlock which is not sleepable.

I was leaning towards the idea that ZFS should somehow hook the teardown lock into vn_start_write() or something like that.
But for ZFS we would also need vn_start_read() as well.  And it's not easy to sprinkle such calls in all places where they are needed...


Just in case, here is the 3-way deadlock that I saw:
(kgdb) tid 102522
(kgdb) bt
#0  sched_switch (td=0xfffff802abd53580, newtd=0xfffff80008d16580, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2005
#1  0xffffffff806a5091 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:438
#2  0xffffffff806e252c in sleepq_switch (wchan=0xfffffe000ec025c8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:611
#3  0xffffffff806e23d2 in sleepq_wait (wchan=0xfffffe000ec025c8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:690
#4  0xffffffff8063ffa3 in _cv_wait (cvp=0xfffffe000ec025c8, lock=<optimized out>) at /usr/src/sys/kern/kern_condvar.c:144
#5  0xffffffff8035bc6b in rrw_enter_read_impl (rrl=0xfffffe000ec025a8, prio=0, tag=0xfffff8065044cb10, try=0) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:201
#6  0xffffffff8035bbb3 in rrw_enter_read (rrl=<unavailable>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:223
#7  0xffffffff8035c2ff in rrm_enter_read (rrl=<optimized out>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:429
#8  0xffffffff8035c2b1 in rrm_enter (rrl=<unavailable>, rw=<optimized out>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:405
#9  0xffffffff803b81db in zfs_freebsd_lock (ap=0xfffffe090e81d4f8) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:6010
#10 0xffffffff8095608d in VOP_LOCK1_APV (vop=<optimized out>, a=0xfffffe090e81d4f8) at vnode_if.c:2087
#11 0xffffffff8075f8ab in VOP_LOCK1 (vp=<optimized out>, flags=<optimized out>, file=<unavailable>, line=<unavailable>) at ./vnode_if.h:859
#12 0xffffffff8075e755 in _vn_lock (vp=<optimized out>, flags=524544, file=0xffffffff80acfa01 "/usr/src/sys/kern/vfs_subr.c", line=2768) at /usr/src/sys/kern/vfs_vnops.c:1555
#13 0xffffffff8074ef5f in vputx (vp=0xfffff8065044cb10, func=1) at /usr/src/sys/kern/vfs_subr.c:2768
#14 0xffffffff8074edbe in vrele (vp=<unavailable>) at /usr/src/sys/kern/vfs_subr.c:2804
#15 0xffffffff8073a901 in vn_vptocnp (vp=0xfffffe090e81d628, cred=0xfffff800234fe000, buf=<optimized out>, buflen=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:2232
#16 0xffffffff8073a384 in vn_fullpath1 (td=0xfffff802abd53580, vp=0xfffff806cd2063b0, rdir=0xfffff8001bd8bb10,
    buf=0xfffff801dec4f800 "\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336"..., retbuf=0xfffffe090e81d770, buflen=1007) at /usr/src/sys/kern/vfs_cache.c:2335
#17 0xffffffff8073a62f in vn_fullpath (td=0xfffff802abd53580, vn=0xfffff8060b24cce8, retbuf=0xfffffe090e81d770, freebuf=0xfffffe090e81d778) at /usr/src/sys/kern/vfs_cache.c:2164
#18 0xffffffff80760af4 in vn_fill_kinfo_vnode (vp=0xfffff8060b24cce8, kif=0xfffff80232b49018) at /usr/src/sys/kern/vfs_vnops.c:2353
#19 0xffffffff8064f27c in export_vnode_to_kinfo (vp=0xfffff8060b24cce8, fd=-1, fflags=1, kif=0xfffff80232b49018, flags=1) at /usr/src/sys/kern/kern_descrip.c:3408
#20 0xffffffff8064e9b4 in export_vnode_to_sb (vp=0xfffff8060b24cce8, fd=-1, fflags=1, efbuf=0xfffff80232b49000) at /usr/src/sys/kern/kern_descrip.c:3474
#21 0xffffffff8064e84c in kern_proc_filedesc_out (p=<optimized out>, sb=0xfffffe090e81d888, maxlen=-1, flags=1) at /usr/src/sys/kern/kern_descrip.c:3537
#22 0xffffffff8064f87d in sysctl_kern_proc_filedesc (oidp=<optimized out>, arg1=<optimized out>, arg2=<optimized out>, req=<optimized out>) at /usr/src/sys/kern/kern_descrip.c:3597
#23 0xffffffff806a7d6f in sysctl_root_handler_locked (oid=0xffffffff80d93760 <sysctl___kern_proc_filedesc>, arg1=0xfffffe090e81da9c, arg2=1, req=0xfffffe090e81d9d0, tracker=0xfffffe090e81d940)
    at /usr/src/sys/kern/kern_sysctl.c:165
#24 0xffffffff806a7545 in sysctl_root (oidp=<optimized out>, arg1=<unavailable>, arg2=1, req=<optimized out>) at /usr/src/sys/kern/kern_sysctl.c:2027
#25 0xffffffff806a7a38 in userland_sysctl (td=<optimized out>, name=0xfffffe090e81da90, namelen=4, old=0x0, oldlenp=<optimized out>, inkernel=<optimized out>, new=<optimized out>, newlen=<optimized out>, retval=<unavailable>,
    flags=0) at /usr/src/sys/kern/kern_sysctl.c:2122
#26 0xffffffff806a78bf in sys___sysctl (td=0xfffff802abd53580, uap=0xfffff802abd53930) at /usr/src/sys/kern/kern_sysctl.c:2057
#27 0xffffffff808fdecb in syscallenter (td=0xfffff802abd53580) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:132
#28 0xffffffff808fda87 in amd64_syscall (td=0xfffff802abd53580, traced=0) at /usr/src/sys/amd64/amd64/trap.c:915

(kgdb) p rrl->rr_writer->td_tid
$2 = 100351

(kgdb) tid 100351
(kgdb) bt
#0  sched_switch (td=0xfffff80025288000, newtd=0xfffff806ff236580, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2005
#1  0xffffffff806a5091 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:438
#2  0xffffffff806e252c in sleepq_switch (wchan=0xfffffe000ec032d8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:611
#3  0xffffffff806e23d2 in sleepq_wait (wchan=0xfffffe000ec032d8, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:690
#4  0xffffffff8063ffa3 in _cv_wait (cvp=0xfffffe000ec032d8, lock=<optimized out>) at /usr/src/sys/kern/kern_condvar.c:144
#5  0xffffffff8035bde5 in rrw_enter_write (rrl=0xfffffe000ec032b8) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:253
#6  0xffffffff8035c329 in rrm_enter_write (rrl=0xfffffe000ec020e8) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:438
#7  0xffffffff8035c2b8 in rrm_enter (rrl=<unavailable>, rw=<unavailable>, tag=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:407
#8  0xffffffff803b364b in zfs_mount (vfsp=0xfffff80043895000) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1715
#9  0xffffffff80747c99 in vfs_domount_update (td=<optimized out>, vp=0xfffff800438b3588, fsflags=268505368, optlist=0xfffffe090e404a80) at /usr/src/sys/kern/vfs_mount.c:976
#10 0xffffffff807454fb in vfs_domount (td=0xfffff80025288000, fstype=<optimized out>, fspath=0xfffff804fe9e41a0 "/usr/obj", fsflags=<optimized out>, optlist=0xfffffe090e404a80) at /usr/src/sys/kern/vfs_mount.c:1132
#11 0xffffffff80744c52 in vfs_donmount (td=0xfffff80025288000, fsflags=<optimized out>, fsoptions=0xfffff8059de54900) at /usr/src/sys/kern/vfs_mount.c:687
#12 0xffffffff80744608 in sys_nmount (td=0xfffff80025288000, uap=<optimized out>) at /usr/src/sys/kern/vfs_mount.c:421
#13 0xffffffff808fdecb in syscallenter (td=0xfffff80025288000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:132
#14 0xffffffff808fda87 in amd64_syscall (td=0xfffff80025288000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:915

(kgdb) tid 102044
(kgdb) bt
#0  sched_switch (td=0xfffff806f0cb0580, newtd=0xfffff80008d17000, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2005
#1  0xffffffff806a5091 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:438
#2  0xffffffff806e252c in sleepq_switch (wchan=0xfffff8065044cb78, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:611
#3  0xffffffff806e23d2 in sleepq_wait (wchan=0xfffff8065044cb78, pri=96) at /usr/src/sys/kern/subr_sleepqueue.c:690
#4  0xffffffff8067311f in sleeplk (lk=0xfffff8065044cb78, flags=2121728, ilk=<optimized out>, wmesg=0xffffffff80aae927 "zfs", pri=96, timo=51, queue=<optimized out>) at /usr/src/sys/kern/kern_lock.c:288
#5  0xffffffff80671c05 in __lockmgr_args (lk=0xfffff8065044cb78, flags=2121728, ilk=0xfffff8065044cba8, wmesg=<optimized out>, pri=<optimized out>, timo=<optimized out>, file=<optimized out>, line=<optimized out>)
    at /usr/src/sys/kern/kern_lock.c:873
#6  0xffffffff806711c5 in lockmgr_lock_fast_path (lk=0xfffff8065044cb78, flags=2121728, ilk=0xfffff8065044cba8, file=0xffffffff80acfa01 "/usr/src/sys/kern/vfs_subr.c", line=2606) at /usr/src/sys/kern/kern_lock.c:605
#7  0xffffffff8073d939 in vop_stdlock (ap=<optimized out>) at /usr/src/sys/kern/vfs_default.c:517
#8  0xffffffff803b81e8 in zfs_freebsd_lock (ap=0xfffffe090e8a9588) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5987
#9  0xffffffff8095608d in VOP_LOCK1_APV (vop=<optimized out>, a=0xfffffe090e8a9588) at vnode_if.c:2087
#10 0xffffffff8075f8ab in VOP_LOCK1 (vp=<optimized out>, flags=<optimized out>, file=<unavailable>, line=<unavailable>) at ./vnode_if.h:859
#11 0xffffffff8075e755 in _vn_lock (vp=<optimized out>, flags=2121728, file=0xffffffff80acfa01 "/usr/src/sys/kern/vfs_subr.c", line=2606) at /usr/src/sys/kern/vfs_vnops.c:1555
#12 0xffffffff8074e5af in vget (vp=0xfffff8065044cb10, flags=<optimized out>, td=0xfffff806f0cb0580) at /usr/src/sys/kern/vfs_subr.c:2606
#13 0xffffffff8073719c in cache_lookup (dvp=0xfffff806cd2063b0, vpp=0xfffffe090e8a9a18, cnp=0xfffffe090e8a9a40, tsp=0x0, ticksp=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:1321
#14 0xffffffff8073a0a4 in vfs_cache_lookup (ap=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:2067
#15 0xffffffff803b8391 in zfs_cache_lookup (ap=<unavailable>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4956
#16 0xffffffff8095275a in VOP_LOOKUP_APV (vop=<optimized out>, a=0xfffffe090e8a9720) at vnode_if.c:127
#17 0xffffffff80743779 in VOP_LOOKUP (dvp=<optimized out>, vpp=<optimized out>, cnp=<unavailable>) at ./vnode_if.h:54
#18 0xffffffff80742bce in lookup (ndp=0xfffffe090e8a99b8) at /usr/src/sys/kern/vfs_lookup.c:891
#19 0xffffffff807424a7 in namei (ndp=0xfffffe090e8a99b8) at /usr/src/sys/kern/vfs_lookup.c:453
#20 0xffffffff8075e046 in vn_open_cred (ndp=0xfffffe090e8a99b8, flagp=0xfffffe090e8a9ac4, cmode=0, vn_open_flags=0, cred=0xfffff800234fe000, fp=0xfffff8072ca41b90) at /usr/src/sys/kern/vfs_vnops.c:277
#21 0xffffffff8075de4f in vn_open (ndp=<unavailable>, flagp=<unavailable>, cmode=<unavailable>, fp=<unavailable>) at /usr/src/sys/kern/vfs_vnops.c:180
#22 0xffffffff80757b2f in kern_openat (td=0xfffff806f0cb0580, fd=-100, path=0x800de1a80 <error: Cannot access memory at address 0x800de1a80>, pathseg=UIO_USERSPACE, flags=1, mode=<optimized out>)
    at /usr/src/sys/kern/vfs_syscalls.c:1082
#23 0xffffffff80757d2b in sys_openat (td=0xfffff806f0cb0580, uap=0xfffff806f0cb0930) at /usr/src/sys/kern/vfs_syscalls.c:1030
#24 0xffffffff808fdecb in syscallenter (td=0xfffff806f0cb0580) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:132
#25 0xffffffff808fda87 in amd64_syscall (td=0xfffff806f0cb0580, traced=0) at /usr/src/sys/amd64/amd64/trap.c:915
Summary:
1. Thread 102044 is doing a lookup, dvp must be already locked and thus the filesystem is read-locked
   The thread is blocked on the child vp lock.
2. Thread 100351 is doing mount -u and wants to write-lock the filesystem, blocked by thread 102044.
3. Thread 102522 does vrele on the child vp and needs to lock it.
   The thread has already got the vnode lock is blocked by thread 100351.

Comment 20 Mark Johnston freebsd_committer

2021-10-28 16:03:38 UTC

(In reply to Andriy Gapon from comment #19)
Yeah, I was also wondering about changing the lock order. I think that would fix the deadlock but this is getting kind of hairy. Maybe we could busy the mountpoint while sleeping on the teardown lock, but I'm not sure.

Taking a step back: zfs_rezget() is triggering the deadlock by busying page cache pages, which it does so that it can purge cached data which would otherwise become stale when the dataset is resumed. But, it really only needs to purge valid pages, invalid pages won't be mapped, and ZFS marks file pages as valid while holding the teardown lock. The deadlock happens when zfs_rezget() is purging _invalid_ pages that getpages is supposed to fill. So perhaps zfs_rezget() can simply ignore invalid pages.

I tried implementing this and it fixes the deadlock in my stress test, which simply runs buildkernel in a loop and simultaneously rolls back the dataset in a loop. This test also uncovered some UAFs, btw:

It's maybe a bit too hacky, since it means that we check the valid state of a page without busying it, and only the VM object lock is held. This is ok for now at least: to mark a page valid, the page must be busied, but the object lock _and_ busy lock are needed to mark a page invalid. So if vm_object_page_remove() encounters an invalid page, there is no guarantee that it won't later become valid. For ZFS I think it's safe to assume that vnode pages only transition invalid->valid under the teardown lock, but that seems like a delicate assumption...

The only other solution I can see is to add a new VOP to lock a vnode in preparation for a getpages call. This VOP could acquire the teardown lock, so we get a consistent lock order vnode->teardown->busy, and then we don't need to deal with recursion. It's not just the page fault handler which needs it though, at least sendfile and the image activator need it as well.

Comment 21 Mark Johnston freebsd_committer

2021-11-10 17:46:34 UTC

See https://reviews.freebsd.org/D32931 for an attempt at a fix.

Comment 22 commit-hook freebsd_committer

2021-11-15 18:03:28 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d28af1abf031ee87a478b37180e3f0c518caedf6

commit d28af1abf031ee87a478b37180e3f0c518caedf6
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-11-15 16:44:04 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-11-15 18:01:30 +0000

    vm: Add a mode to vm_object_page_remove() which skips invalid pages

    This will be used to break a deadlock in ZFS between the per-mountpoint
    teardown lock and page busy locks.  In particular, when purging data
    from the page cache during dataset rollback, we want to avoid blocking
    on the busy state of invalid pages since the busying thread may be
    blocked on the teardown lock in zfs_getpages().

    Add a helper, vn_pages_remove_valid(), for use by filesystems.  Bump
    __FreeBSD_version so that the OpenZFS port can make use of the new
    helper.

    PR:             258208
    Reviewed by:    avg, kib, sef
    Tested by:      pho (part of a larger patch)
    MFC after:      2 weeks
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D32931

 sys/kern/vfs_vnops.c | 22 ++++++++++++++++++++++
 sys/sys/vnode.h      |  2 ++
 sys/vm/vm_object.c   | 19 +++++++++++++++++++
 sys/vm/vm_object.h   |  1 +
 4 files changed, 44 insertions(+)

Comment 23 commit-hook freebsd_committer

2021-11-20 16:37:41 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=705a6ee2b6112c3a653b2bd68f961a8b5b8071a4

commit 705a6ee2b6112c3a653b2bd68f961a8b5b8071a4
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-11-20 16:21:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-11-20 16:21:25 +0000

    zfs: Fix a deadlock between page busy and the teardown lock

    When rolling back a dataset, ZFS has to purge file data resident in the
    system page cache.  To do this, it loops over all vnodes for the
    mountpoint and calls vn_pages_remove() to purge pages associated with
    the vnode's VM object.  Each page is thus exclusively busied while the
    dataset's teardown write lock is held.

    When handling a page fault on a mapped ZFS file, FreeBSD's page fault
    handler busies newly allocated pages and then uses VOP_GETPAGES to fill
    them.  The ZFS getpages VOP acquires the teardown read lock with vnode
    pages already busied.  This represents a lock order reversal which can
    lead to deadlock.

    To break the deadlock, observe that zfs_rezget() need only purge those
    pages marked valid, and that pages busied by the page fault handler are,
    by definition, invalid.  Furthermore, ZFS pages always transition from
    invalid to valid with the teardown lock held, and ZFS never creates
    partially valid pages.  Thus, zfs_rezget() can use the new
    vn_pages_remove_valid() to skip over pages busied by the fault handler.

    PR:             258208
    Tested by:      pho
    Reviewed by:    avg, sef, kib
    MFC after:      2 weeks
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D32931

 sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comment 24 commit-hook freebsd_committer

2021-11-29 14:22:05 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=fdd27db34802decf062339411e5f84993e733be0

commit fdd27db34802decf062339411e5f84993e733be0
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-11-15 16:44:04 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-11-29 14:09:28 +0000

    vm: Add a mode to vm_object_page_remove() which skips invalid pages

    This will be used to break a deadlock in ZFS between the per-mountpoint
    teardown lock and page busy locks.  In particular, when purging data
    from the page cache during dataset rollback, we want to avoid blocking
    on the busy state of invalid pages since the busying thread may be
    blocked on the teardown lock in zfs_getpages().

    Add a helper, vn_pages_remove_valid(), for use by filesystems.  Bump
    __FreeBSD_version so that the OpenZFS port can make use of the new
    helper.

    PR:             258208
    Reviewed by:    avg, kib, sef
    Tested by:      pho (part of a larger patch)
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit d28af1abf031ee87a478b37180e3f0c518caedf6)

 sys/kern/vfs_vnops.c | 22 ++++++++++++++++++++++
 sys/sys/vnode.h      |  2 ++
 sys/vm/vm_object.c   | 19 +++++++++++++++++++
 sys/vm/vm_object.h   |  1 +
 4 files changed, 44 insertions(+)

Comment 25 commit-hook freebsd_committer

2021-12-05 18:39:33 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0c5f290ed90963b64b05a2098e0b11c3d9f6471f

commit 0c5f290ed90963b64b05a2098e0b11c3d9f6471f
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-11-20 16:21:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-12-05 18:38:35 +0000

    zfs: Fix a deadlock between page busy and the teardown lock

    When rolling back a dataset, ZFS has to purge file data resident in the
    system page cache.  To do this, it loops over all vnodes for the
    mountpoint and calls vn_pages_remove() to purge pages associated with
    the vnode's VM object.  Each page is thus exclusively busied while the
    dataset's teardown write lock is held.

    When handling a page fault on a mapped ZFS file, FreeBSD's page fault
    handler busies newly allocated pages and then uses VOP_GETPAGES to fill
    them.  The ZFS getpages VOP acquires the teardown read lock with vnode
    pages already busied.  This represents a lock order reversal which can
    lead to deadlock.

    To break the deadlock, observe that zfs_rezget() need only purge those
    pages marked valid, and that pages busied by the page fault handler are,
    by definition, invalid.  Furthermore, ZFS pages always transition from
    invalid to valid with the teardown lock held, and ZFS never creates
    partially valid pages.  Thus, zfs_rezget() can use the new
    vn_pages_remove_valid() to skip over pages busied by the fault handler.

    PR:             258208
    Tested by:      pho
    Reviewed by:    avg, sef, kib
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit 705a6ee2b6112c3a653b2bd68f961a8b5b8071a4)

 sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comment 26 Mark Johnston freebsd_committer

2021-12-05 18:41:19 UTC

I believe the original problem is now fixed on main and stable/13.  There is a stress2 scenario which triggers the deadlock (zfs14.sh , so we now have a nice regression test too.

Comment 27 commit-hook freebsd_committer

2021-12-15 01:25:12 UTC

A commit in branch vendor/openzfs/master references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cdf74673bc7993a02264210b796a77193357966c

commit cdf74673bc7993a02264210b796a77193357966c
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-11-20 16:21:25 +0000
Commit:     Brian Behlendorf <behlendorf1@llnl.gov>
CommitDate: 2021-12-12 19:13:18 +0000

    zfs: Fix a deadlock between page busy and the teardown lock

    When rolling back a dataset, ZFS has to purge file data resident in the
    system page cache.  To do this, it loops over all vnodes for the
    mountpoint and calls vn_pages_remove() to purge pages associated with
    the vnode's VM object.  Each page is thus exclusively busied while the
    dataset's teardown write lock is held.

    When handling a page fault on a mapped ZFS file, FreeBSD's page fault
    handler busies newly allocated pages and then uses VOP_GETPAGES to fill
    them.  The ZFS getpages VOP acquires the teardown read lock with vnode
    pages already busied.  This represents a lock order reversal which can
    lead to deadlock.

    To break the deadlock, observe that zfs_rezget() need only purge those
    pages marked valid, and that pages busied by the page fault handler are,
    by definition, invalid.  Furthermore, ZFS pages always transition from
    invalid to valid with the teardown lock held, and ZFS never creates
    partially valid pages.  Thus, zfs_rezget() can use the new
    vn_pages_remove_valid() to skip over pages busied by the fault handler.

    PR:             258208
    Tested by:      pho
    Reviewed by:    avg, sef, kib
    MFC after:      2 weeks
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D32931

    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
    Closes #12828

 module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comment 28 commit-hook freebsd_committer

2022-03-11 06:32:30 UTC

A commit in branch vendor/openzfs/zfs-2.1-release references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b3427b18b1c694f9dbc4673a11a6cf57c7f5c5d8

commit b3427b18b1c694f9dbc4673a11a6cf57c7f5c5d8
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-11-20 16:21:25 +0000
Commit:     Tony Hutter <hutter2@llnl.gov>
CommitDate: 2022-03-04 23:37:41 +0000

    zfs: Fix a deadlock between page busy and the teardown lock

    When rolling back a dataset, ZFS has to purge file data resident in the
    system page cache.  To do this, it loops over all vnodes for the
    mountpoint and calls vn_pages_remove() to purge pages associated with
    the vnode's VM object.  Each page is thus exclusively busied while the
    dataset's teardown write lock is held.

    When handling a page fault on a mapped ZFS file, FreeBSD's page fault
    handler busies newly allocated pages and then uses VOP_GETPAGES to fill
    them.  The ZFS getpages VOP acquires the teardown read lock with vnode
    pages already busied.  This represents a lock order reversal which can
    lead to deadlock.

    To break the deadlock, observe that zfs_rezget() need only purge those
    pages marked valid, and that pages busied by the page fault handler are,
    by definition, invalid.  Furthermore, ZFS pages always transition from
    invalid to valid with the teardown lock held, and ZFS never creates
    partially valid pages.  Thus, zfs_rezget() can use the new
    vn_pages_remove_valid() to skip over pages busied by the fault handler.

    PR:             258208
    Tested by:      pho
    Reviewed by:    avg, sef, kib
    MFC after:      2 weeks
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D32931

    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
    Closes #12828

 module/os/freebsd/zfs/zfs_znode.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comment 29 Graham Perrin freebsd_committer

2022-03-11 09:57:46 UTC

Please: how much of this – pre- and post-fix – might land in releng/13.1?

Is releng/13.2 more likely?

I do see the two-week MFC for the post-fix commit, so I'm not asking anyone to jump the gun. Just curious (with draft release notes for 13.1 in mind). 

Thanks

Comment 30 Mark Johnston freebsd_committer

2022-03-14 15:30:26 UTC

(In reply to Graham Perrin from comment #29)
What do you mean by pre- and post-fixes?  The commit which fixes the original bug has been in stable/13 for several months and so will be included in releng/13.1.