241083 – zfs: zpool import seems to probe all snapshots

Bug 241083 - zfs: zpool import seems to probe all snapshots

Summary: zfs: zpool import seems to probe all snapshots

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	12.0-STABLE
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Alan Somers

URL:	https://reviews.freebsd.org/D22077
Keywords:

Depends on:
Blocks:

Reported:	2019-10-05 14:13 UTC by Martin Birgmeier
Modified:	2021-06-17 21:05 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
stderr output (79 bytes, text/plain) 2019-10-05 20:06 UTC, Martin Birgmeier	no flags	Details
stdout (2.36 KB, text/plain) 2019-10-05 20:07 UTC, Martin Birgmeier	no flags	Details
Speed up "zpool import" in the presence of many zvols (1.12 KB, patch) 2019-10-05 22:47 UTC, Alan Somers	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Birgmeier 2019-10-05 14:13:34 UTC

Scenario:
- Homegrown server, 6 disks
- Each disk partitioned: ca. 3/4 for the first zpool ("hal.1"), ca. 1/5 for the second zpool, rest for misc. stuff
- The first zpool is imported at boot (from unencrypted "3/4" partitions), it contains about 300 zfs filesystems + 20 volumes, where the latter are frequently partitioned (MBR, BSD, MBR + BSD, because they are used as iSCSI disks for virtualized installations of various operating systems)
- The second zpool is manually imported (when needed) from the encrypted partitions (each of the "1/5" partitions is encrypted), it contains about 20 filesystems (incl. volumes)
- Backups are kept by regularly taking recursive snapshots on both pools, about 10 backups are kept at any time
- This results in over 500 device entries in /dev/zvol (e.g., "/dev/zvol/hal.1/1/vdisks/925@backup.2019-09-21.21:59:27s2a")
- At some point, the second zpool is imported

Result:
- The import takes a long time
- Using "gstat -a", it can be observed that there is much activity on each of the devices in /dev/zvol/... during the time it takes to import the second pool

Expected result:
- Importing a pool should not look at snapshot but only "normal" devices, where "normal" could be defined either as not containing a "@" or as being read/write or by trying the devices with the shortest path first (or maybe there is a better discriminator) in order to speed up the import
- Maybe (by default) it should not look at devices in /dev/zvol at all because normally one would not want to create a pool from devices in another pool

Comment 1 Alan Somers freebsd_committer

2019-10-05 16:53:06 UTC

Here are a few suggestions:

1) Did you set the cachefile property on the second pool to "none"?  If so, that's the main reason why import is taking so long, because ZFS needs to probe every geom provider.  If you set the cachefile property to default then your pool will be imported at boot time, but the import will happen quickly.

2) What is the setting of vfs.zfs.vol.recursive?  If it's non-zero, that might be contributing to your problem.

3) Are you exposing the zvols as GEOM providers or simple block devices?  I think setting vfs.zfs.vol.mode to 2 or 3 should help you.  (1=geom, 2=dev, 3=none).  Of course, ctld won't work if you set it to 3.

Comment 2 Martin Birgmeier 2019-10-05 18:12:09 UTC

The settings are ("hal.5" is the second pool):

[0]# zpool get cachefile hal.5
NAME   PROPERTY   VALUE      SOURCE
hal.5  cachefile  -          default
[0]# 

But I guess since I usually have it exported when unused any traces of it are cleared from the cache, leading to a re-scan of all devices on the next import?

[0]# sysctl vfs.zfs.vol.recursive
vfs.zfs.vol.recursive: 0
[0]# 

[0]# sysctl vfs.zfs.vol.mode
vfs.zfs.vol.mode: 1
[0]# 

I did not know about this setting at all. Where can I read about it? 3 won't work since I am exporting most of them as iSCSI block devices via ctld. Do you think setting it to 2 would stop scanning them during zpool import? If it is "dev" as you say, and the import looks at all devs, this would not help I assume?

Thanks for assisting with this.

-- Martin

Comment 3 Alan Somers freebsd_committer

2019-10-05 18:53:22 UTC

To learn more about vfs.zfs.vol.mode, read zfs(8) and search for "volmode".  Setting it to 2 ought to stop ZFS from scanning zvols during import, because ZFS can only use geom devices as disks, not arbitrary device nodes.  However, I don't know if ctld will be happy about that.

HOWEVER, I think there may be a bug here.  When vfs.zfs.vol.recursive is 0, ZFS isn't supposed to taste zvols when importing a pool.  But catch is that ZFS has to taste disks twice: once from userspace and once from kernelspace.  I think the userspace part is missing the check to exclude zvols.

To test, run this command.  It will measure how much time ZFS takes to taste devices from userspace, and how much time the kernel module spends importing devices.  If my hunch is correct, you're spending most of your time tasting from userspace.

sudo dtrace -n 'pid$target::zpool_search_import:entry {trace(timestamp);}' -n 'pid$target::zpool_search_import:return {trace(timestamp);}' -n 'pid$target::zfs_ioctl:entry {trace(timestamp)}' -n 'pid$target::zfs_ioctl:return {trace(timestamp)}' -c "zpool import hal.5".

Comment 4 Martin Birgmeier 2019-10-05 19:02:48 UTC

This is the result:

[0]# sudo dtrace -n 'pid$target::zpool_search_import:entry {trace(timestamp);}' -n 'pid$target::zpool_search_import:return {trace(timestamp);}' -n 'pid$target::zfs_ioctl:entry {trace(timestamp)}' -n 'pid$target::zfs_ioctl:return {trace(timestamp)}' -c "zpool import hal.5"                                         
dtrace: description 'pid$target::zpool_search_import:entry ' matched 1 probe
dtrace: description 'pid$target::zpool_search_import:return ' matched 1 probe
dtrace: description 'pid$target::zfs_ioctl:entry ' matched 1 probe
dtrace: description 'pid$target::zfs_ioctl:return ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  3  49088        zpool_search_import:entry    42976262797132
  3  49089       zpool_search_import:return    42976263261730
  0  49090                  zfs_ioctl:entry    43041357153847
  0  49091                 zfs_ioctl:return    43041357158293
dtrace: pid 58369 has exited

[0]# 

Thanks again.

-- Martin

Comment 5 Alan Somers freebsd_committer

2019-10-05 19:06:59 UTC

That didn't take very long for either step.  What was the total runtime of "zpool import"?

Comment 6 Martin Birgmeier 2019-10-05 19:10:43 UTC

1:22 (82 seconds) according to zsh history.

Comment 7 Alan Somers freebsd_committer

2019-10-05 19:34:26 UTC

Well, there goes that theory.  Could you please try again with this dtrace script?  The output will be large, so pipe it to a file.

sudo dtrace -n 'profile-97 /pid == $target/ {@[ustack()] = count(); }' -c "zpool import hal.5"

Comment 8 Martin Birgmeier 2019-10-05 20:06:38 UTC

Created attachment 208125 [details]
stderr output

Comment 9 Martin Birgmeier 2019-10-05 20:07:21 UTC

Created attachment 208126 [details]
stdout

[0]# dtrace -n 'profile-97 /pid == $target/ {@[ustack()] = count(); }' -c "zpool import hal.5" > /usr/tmp/bug-241083.1.stdout 2> /usr/tmp/bug-241083.1.stderr
[0]# 

Strangely enough, the command now took only 13 seconds to complete.

Comment 10 Martin Birgmeier 2019-10-05 20:19:12 UTC

Am doing a plain 'zpool export hal.5' right now; this returns nearly immediately, but I see all zvols being accessed a long time after (using 'gstat -a').

An iSCSI export is actively being used (but with a rather low I/O load) all the time.

-- Martin

Comment 11 Martin Birgmeier 2019-10-05 20:20:57 UTC

Doing 'zpool import hal.5' again, this quickly returns (only 8 seconds). zvols again continue to be accessed for a whole while afterwards. Still doing that while I was writing the previous sentences.

Comment 12 Martin Birgmeier 2019-10-05 20:29:45 UTC

Just did 'zpool export hal.5' again, returned nearly immediately, did not access the zvols (according to 'gstat -a').

I don't understand what is going on here...

Comment 13 Martin Birgmeier 2019-10-05 20:33:22 UTC

Maybe because I/O on this server has dropped to such low levels that stuff is still in some buffer cache?

Comment 14 Alan Somers freebsd_committer

2019-10-05 20:46:41 UTC

No, the process of tasting devices bypasses the buffer cache.  Are you running zfsd?  Because that may taste devices after the import is complete.

Comment 15 Alan Somers freebsd_committer

2019-10-05 22:47:10 UTC

Created attachment 208127 [details]
Speed up "zpool import" in the presence of many zvols

Please test this patch.  Apply the patch, the build and install libzfs.  No need to build the kernel or the zpool command, nor any need to reboot.  Then time "zpool import nonexistent_pool_name" with vfs.zfs.vol.recursive set to both 0 and 1.

Comment 16 Martin Birgmeier 2019-10-06 05:48:02 UTC

No, not using zfsd.

I'll try the patch.

Thanks for the support!

-- Martin

Comment 17 Martin Birgmeier 2019-10-06 07:18:17 UTC

Hmmm...

[0]# ll /lib/libzfs.so.3*
-r--r--r--  1 root  wheel  310944 Oct  6 09:10 /lib/libzfs.so.3
-r--r--r--  1 root  wheel  310944 Dec 30  2018 /lib/libzfs.so.3.ORIG
[0]# sysctl vfs.zfs.vol.recursive
vfs.zfs.vol.recursive: 0
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.04s user 1.26s system 4% cpu 29.296 total
[1]# sysctl vfs.zfs.vol.recursive=1    
vfs.zfs.vol.recursive: 0 -> 1
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 1.11s system 28% cpu 4.033 total
[1]# sysctl vfs.zfs.vol.recursive=0    
vfs.zfs.vol.recursive: 1 -> 0
[0]# 

Note that using 'gstat -a' I saw that after the first import there was again a long time when all zvols would be active (the system has been running since yesterday as a lightly loaded iSCSI server). I waited for this activity to abate and then started the second import.

Again I have the impression that some caching mechanism is at work.

-- Martin

Comment 18 Alan Somers freebsd_committer

2019-10-06 14:22:50 UTC

I notice that your old and new libzfs.so.3 files are exactly the same size.  Are you sure that you applied the patch correctly?

Comment 19 Martin Birgmeier 2019-10-06 16:02:04 UTC

I also noticed the same thing, but they are indeed different.

[0]# ll /lib/libzfs.so.3*
-r--r--r--  1 root  wheel  310944 Oct  6 09:10 /lib/libzfs.so.3
-r--r--r--  1 root  wheel  310944 Dec 30  2018 /lib/libzfs.so.3.ORIG
[0]# diff =(strings /lib/libzfs.so.3.ORIG) =(strings /lib/libzfs.so.3)         
870a871
> vfs.zfs.vol.recursive
1230a1232
> ZFS::ZVOL
1586a1589
> sysctlbyname
1989d1991
< sysctlbyname
2125,2126d2126
< fff.
< ffff.
2128,2130d2127
< l$(I
< \$0H
< ffffff.
2131a2129,2132
> ffff.
> fffff.
> fffff.
> s0A9
2987a2989
> Lc5=
[1]# 

Btw I had to adapt bool in the patch as it does not exist in 12.0 (it is boolean_t and B_TRUE). (The patch initially applied cleanly but could not be compiled.)

-- Martin

Comment 20 Alan Somers freebsd_committer

2019-10-07 17:49:23 UTC

I can reproduce the problem locally, and the patch does speed up import for me.  Are you experiencing cache effects perhaps?  Please try the experiment again, but repeat the import command 4 times with each vfs.zfs.vol.recursive setting.  That should let us rule out cache effects.

Comment 21 Martin Birgmeier 2019-10-07 18:42:39 UTC

Here are two sets of results:

[0]# ll /lib/libzfs.so.3*              
-r--r--r--  1 root  wheel  310944 Oct  6 09:10 /lib/libzfs.so.3
-r--r--r--  1 root  wheel  310944 Dec 30  2018 /lib/libzfs.so.3.ORIG
[0]# sysctl vfs.zfs.vol.recursive      
vfs.zfs.vol.recursive: 0
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.05s user 1.00s system 9% cpu 11.424 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.06s user 0.79s system 77% cpu 1.106 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.79s system 83% cpu 0.972 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.78s system 108% cpu 0.741 total
[1]# sysctl vfs.zfs.vol.recursive=1
vfs.zfs.vol.recursive: 0 -> 1
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.98s system 135% cpu 0.747 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.95s system 220% cpu 0.440 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.93s system 249% cpu 0.380 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.05s user 0.92s system 249% cpu 0.387 total
[1]# sysctl vfs.zfs.vol.recursive=0    
vfs.zfs.vol.recursive: 1 -> 0
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.83s system 147% cpu 0.581 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.80s system 240% cpu 0.343 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.79s system 103% cpu 0.785 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.79s system 234% cpu 0.342 total
[1]# 

One more observation: If there is medium to high disk load, such imports take much longer. For example, doing a 'tar cf /dev/null /z/backup/rsync' (with that directory holding a lot of stuff) in parallel, I get (the pool will give a read rate of between 50..200 MB/s, and the machine has 16 GB main memory):

[0]# sysctl vfs.zfs.vol.recursive=0    
vfs.zfs.vol.recursive: 1 -> 0
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.85s system 12% cpu 6.924 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.81s system 16% cpu 5.110 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.78s system 21% cpu 3.753 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.85s system 11% cpu 7.981 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.02s user 0.81s system 15% cpu 5.366 total
[1]# sysctl vfs.zfs.vol.recursive=1    
vfs.zfs.vol.recursive: 0 -> 1
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.98s system 9% cpu 10.749 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.97s system 12% cpu 7.945 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.94s system 28% cpu 3.382 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.04s user 0.95s system 8% cpu 11.164 total
[1]# sysctl vfs.zfs.vol.recursive=0    
vfs.zfs.vol.recursive: 1 -> 0
[0]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.81s system 8% cpu 10.134 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.01s user 0.81s system 11% cpu 7.034 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.80s system 9% cpu 8.948 total
[1]# time zpool import nonexistent_pool
cannot import 'nonexistent_pool': no such pool available
zpool import nonexistent_pool  0.03s user 0.78s system 15% cpu 5.380 total
[1]# diff =(strings /lib/libzfs.so.3.ORIG) =(strings /lib/libzfs.so.3)
870a871
> vfs.zfs.vol.recursive
1230a1232
> ZFS::ZVOL
1586a1589
> sysctlbyname
1989d1991
< sysctlbyname
2125,2126d2126
< fff.
< ffff.
2128,2130d2127
< l$(I
< \$0H
< ffffff.
2131a2129,2132
> ffff.
> fffff.
> fffff.
> s0A9
2987a2989
> Lc5=
[1]# 

-- Martin

Comment 22 Martin Birgmeier 2019-10-07 18:48:42 UTC

One more thing: Today, the system was not yet up for a long time and also didn't have much disk activity. Could it be that there is still enough free (main) memory for the zfs buffer cache (or whatever it's called) and therefore all times are relatively low, whereas when such a cache is full the system first has to make space in order to be able to read in the information needed for the import, and freeing such space is continuously negated by continued high disk activity which would fill any space in the buffer cache immediately again?

-- Martin

Comment 23 Alan Somers freebsd_committer

2019-10-07 18:59:31 UTC

Yes, high disk activity will certainly interfere with the measurements.  And more snapshots will make the picture clearer.  For my tests, I used one zvol with 3000 snapshots.

Comment 24 commit-hook freebsd_committer

2020-01-28 23:07:39 UTC

A commit references this bug:

Author: asomers
Date: Tue Jan 28 23:07:32 UTC 2020
New revision: 357235
URL: https://svnweb.freebsd.org/changeset/base/357235

Log:
  Speed up "zpool import" in the presence of many zvols

  By default, zpools may not be backed by zvols (that can be changed with the
  "vfs.zfs.vol.recursive" sysctl). When that sysctl is set to 0, the kernel
  does not attempt to read zvols when looking for vdevs. But the zpool command
  still does. This change brings the zpool command into line with the kernel's
  behavior. It speeds "zpool import" when an already imported pool has many
  zvols, or a zvol with many snapshots.

  PR:		241083
  Reported by:	Martin Birgmeier <d8zNeCFG@aon.at>
  Reviewed by:	mav, Ryan Moeller <ryan@freqlabs.com>
  MFC after:	2 weeks
  Sponsored by:	Axcient
  Differential Revision:	https://reviews.freebsd.org/D22077

Changes:
  head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_import.c

Comment 25 commit-hook freebsd_committer

2020-02-13 20:49:52 UTC

A commit references this bug:

Author: asomers
Date: Thu Feb 13 20:49:46 UTC 2020
New revision: 357881
URL: https://svnweb.freebsd.org/changeset/base/357881

Log:
  MFC r357235:

  Speed up "zpool import" in the presence of many zvols

  By default, zpools may not be backed by zvols (that can be changed with the
  "vfs.zfs.vol.recursive" sysctl). When that sysctl is set to 0, the kernel
  does not attempt to read zvols when looking for vdevs. But the zpool command
  still does. This change brings the zpool command into line with the kernel's
  behavior. It speeds "zpool import" when an already imported pool has many
  zvols, or a zvol with many snapshots.

  PR:		241083
  Reported by:	Martin Birgmeier <d8zNeCFG@aon.at>
  Reviewed by:	mav, Ryan Moeller <ryan@freqlabs.com>
  Sponsored by:	Axcient
  Differential Revision:	https://reviews.freebsd.org/D22077

Changes:
_U  stable/12/
  stable/12/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_import.c

Comment 26 Alan Somers freebsd_committer

2021-01-20 22:18:56 UTC

Reopening because the issue has regressed in FreeBSD 13, due to the OpenZFS import.

Comment 27 Alan Somers freebsd_committer

2021-01-22 23:44:07 UTC

OpenZFS PR: https://github.com/openzfs/zfs/pull/11502 .

Comment 28 Alan Somers freebsd_committer

2021-03-07 22:57:17 UTC

The OpenZFS PR was accepted, and integrated into FreeBSD 13.0-CURRENT by https://cgit.freebsd.org/src/commit/?id=69d9cc88d213f3d95526636743f15371f2fa6c2c .  Still not in stable/13, however.

Comment 29 Alan Somers freebsd_committer

2021-06-17 21:05:03 UTC

Merged to stable/13 by 04c4bd7f7b525b1627304894e3413693e945bc0d .  Or maybe by 21b12f3e451b0d21ff4a3776cb5d11750c7cc822, the git history is confusing.  But it's definitely in there.