Scenario: - Homegrown server, 6 disks - Each disk partitioned: ca. 3/4 for the first zpool ("hal.1"), ca. 1/5 for the second zpool, rest for misc. stuff - The first zpool is imported at boot (from unencrypted "3/4" partitions), it contains about 300 zfs filesystems + 20 volumes, where the latter are frequently partitioned (MBR, BSD, MBR + BSD, because they are used as iSCSI disks for virtualized installations of various operating systems) - The second zpool is manually imported (when needed) from the encrypted partitions (each of the "1/5" partitions is encrypted), it contains about 20 filesystems (incl. volumes) - Backups are kept by regularly taking recursive snapshots on both pools, about 10 backups are kept at any time - This results in over 500 device entries in /dev/zvol (e.g., "/dev/zvol/hal.1/1/vdisks/925@backup.2019-09-21.21:59:27s2a") - At some point, the second zpool is imported Result: - The import takes a long time - Using "gstat -a", it can be observed that there is much activity on each of the devices in /dev/zvol/... during the time it takes to import the second pool Expected result: - Importing a pool should not look at snapshot but only "normal" devices, where "normal" could be defined either as not containing a "@" or as being read/write or by trying the devices with the shortest path first (or maybe there is a better discriminator) in order to speed up the import - Maybe (by default) it should not look at devices in /dev/zvol at all because normally one would not want to create a pool from devices in another pool
Here are a few suggestions: 1) Did you set the cachefile property on the second pool to "none"? If so, that's the main reason why import is taking so long, because ZFS needs to probe every geom provider. If you set the cachefile property to default then your pool will be imported at boot time, but the import will happen quickly. 2) What is the setting of vfs.zfs.vol.recursive? If it's non-zero, that might be contributing to your problem. 3) Are you exposing the zvols as GEOM providers or simple block devices? I think setting vfs.zfs.vol.mode to 2 or 3 should help you. (1=geom, 2=dev, 3=none). Of course, ctld won't work if you set it to 3.
The settings are ("hal.5" is the second pool): [0]# zpool get cachefile hal.5 NAME PROPERTY VALUE SOURCE hal.5 cachefile - default [0]# But I guess since I usually have it exported when unused any traces of it are cleared from the cache, leading to a re-scan of all devices on the next import? [0]# sysctl vfs.zfs.vol.recursive vfs.zfs.vol.recursive: 0 [0]# [0]# sysctl vfs.zfs.vol.mode vfs.zfs.vol.mode: 1 [0]# I did not know about this setting at all. Where can I read about it? 3 won't work since I am exporting most of them as iSCSI block devices via ctld. Do you think setting it to 2 would stop scanning them during zpool import? If it is "dev" as you say, and the import looks at all devs, this would not help I assume? Thanks for assisting with this. -- Martin
To learn more about vfs.zfs.vol.mode, read zfs(8) and search for "volmode". Setting it to 2 ought to stop ZFS from scanning zvols during import, because ZFS can only use geom devices as disks, not arbitrary device nodes. However, I don't know if ctld will be happy about that. HOWEVER, I think there may be a bug here. When vfs.zfs.vol.recursive is 0, ZFS isn't supposed to taste zvols when importing a pool. But catch is that ZFS has to taste disks twice: once from userspace and once from kernelspace. I think the userspace part is missing the check to exclude zvols. To test, run this command. It will measure how much time ZFS takes to taste devices from userspace, and how much time the kernel module spends importing devices. If my hunch is correct, you're spending most of your time tasting from userspace. sudo dtrace -n 'pid$target::zpool_search_import:entry {trace(timestamp);}' -n 'pid$target::zpool_search_import:return {trace(timestamp);}' -n 'pid$target::zfs_ioctl:entry {trace(timestamp)}' -n 'pid$target::zfs_ioctl:return {trace(timestamp)}' -c "zpool import hal.5".
This is the result: [0]# sudo dtrace -n 'pid$target::zpool_search_import:entry {trace(timestamp);}' -n 'pid$target::zpool_search_import:return {trace(timestamp);}' -n 'pid$target::zfs_ioctl:entry {trace(timestamp)}' -n 'pid$target::zfs_ioctl:return {trace(timestamp)}' -c "zpool import hal.5" dtrace: description 'pid$target::zpool_search_import:entry ' matched 1 probe dtrace: description 'pid$target::zpool_search_import:return ' matched 1 probe dtrace: description 'pid$target::zfs_ioctl:entry ' matched 1 probe dtrace: description 'pid$target::zfs_ioctl:return ' matched 1 probe CPU ID FUNCTION:NAME 3 49088 zpool_search_import:entry 42976262797132 3 49089 zpool_search_import:return 42976263261730 0 49090 zfs_ioctl:entry 43041357153847 0 49091 zfs_ioctl:return 43041357158293 dtrace: pid 58369 has exited [0]# Thanks again. -- Martin
That didn't take very long for either step. What was the total runtime of "zpool import"?
1:22 (82 seconds) according to zsh history.
Well, there goes that theory. Could you please try again with this dtrace script? The output will be large, so pipe it to a file. sudo dtrace -n 'profile-97 /pid == $target/ {@[ustack()] = count(); }' -c "zpool import hal.5"
Created attachment 208125 [details] stderr output
Created attachment 208126 [details] stdout [0]# dtrace -n 'profile-97 /pid == $target/ {@[ustack()] = count(); }' -c "zpool import hal.5" > /usr/tmp/bug-241083.1.stdout 2> /usr/tmp/bug-241083.1.stderr [0]# Strangely enough, the command now took only 13 seconds to complete.
Am doing a plain 'zpool export hal.5' right now; this returns nearly immediately, but I see all zvols being accessed a long time after (using 'gstat -a'). An iSCSI export is actively being used (but with a rather low I/O load) all the time. -- Martin
Doing 'zpool import hal.5' again, this quickly returns (only 8 seconds). zvols again continue to be accessed for a whole while afterwards. Still doing that while I was writing the previous sentences.
Just did 'zpool export hal.5' again, returned nearly immediately, did not access the zvols (according to 'gstat -a'). I don't understand what is going on here...
Maybe because I/O on this server has dropped to such low levels that stuff is still in some buffer cache?
No, the process of tasting devices bypasses the buffer cache. Are you running zfsd? Because that may taste devices after the import is complete.
Created attachment 208127 [details] Speed up "zpool import" in the presence of many zvols Please test this patch. Apply the patch, the build and install libzfs. No need to build the kernel or the zpool command, nor any need to reboot. Then time "zpool import nonexistent_pool_name" with vfs.zfs.vol.recursive set to both 0 and 1.
No, not using zfsd. I'll try the patch. Thanks for the support! -- Martin
Hmmm... [0]# ll /lib/libzfs.so.3* -r--r--r-- 1 root wheel 310944 Oct 6 09:10 /lib/libzfs.so.3 -r--r--r-- 1 root wheel 310944 Dec 30 2018 /lib/libzfs.so.3.ORIG [0]# sysctl vfs.zfs.vol.recursive vfs.zfs.vol.recursive: 0 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.04s user 1.26s system 4% cpu 29.296 total [1]# sysctl vfs.zfs.vol.recursive=1 vfs.zfs.vol.recursive: 0 -> 1 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 1.11s system 28% cpu 4.033 total [1]# sysctl vfs.zfs.vol.recursive=0 vfs.zfs.vol.recursive: 1 -> 0 [0]# Note that using 'gstat -a' I saw that after the first import there was again a long time when all zvols would be active (the system has been running since yesterday as a lightly loaded iSCSI server). I waited for this activity to abate and then started the second import. Again I have the impression that some caching mechanism is at work. -- Martin
I notice that your old and new libzfs.so.3 files are exactly the same size. Are you sure that you applied the patch correctly?
I also noticed the same thing, but they are indeed different. [0]# ll /lib/libzfs.so.3* -r--r--r-- 1 root wheel 310944 Oct 6 09:10 /lib/libzfs.so.3 -r--r--r-- 1 root wheel 310944 Dec 30 2018 /lib/libzfs.so.3.ORIG [0]# diff =(strings /lib/libzfs.so.3.ORIG) =(strings /lib/libzfs.so.3) 870a871 > vfs.zfs.vol.recursive 1230a1232 > ZFS::ZVOL 1586a1589 > sysctlbyname 1989d1991 < sysctlbyname 2125,2126d2126 < fff. < ffff. 2128,2130d2127 < l$(I < \$0H < ffffff. 2131a2129,2132 > ffff. > fffff. > fffff. > s0A9 2987a2989 > Lc5= [1]# Btw I had to adapt bool in the patch as it does not exist in 12.0 (it is boolean_t and B_TRUE). (The patch initially applied cleanly but could not be compiled.) -- Martin
I can reproduce the problem locally, and the patch does speed up import for me. Are you experiencing cache effects perhaps? Please try the experiment again, but repeat the import command 4 times with each vfs.zfs.vol.recursive setting. That should let us rule out cache effects.
Here are two sets of results: [0]# ll /lib/libzfs.so.3* -r--r--r-- 1 root wheel 310944 Oct 6 09:10 /lib/libzfs.so.3 -r--r--r-- 1 root wheel 310944 Dec 30 2018 /lib/libzfs.so.3.ORIG [0]# sysctl vfs.zfs.vol.recursive vfs.zfs.vol.recursive: 0 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.05s user 1.00s system 9% cpu 11.424 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.06s user 0.79s system 77% cpu 1.106 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.79s system 83% cpu 0.972 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.78s system 108% cpu 0.741 total [1]# sysctl vfs.zfs.vol.recursive=1 vfs.zfs.vol.recursive: 0 -> 1 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.98s system 135% cpu 0.747 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.95s system 220% cpu 0.440 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.93s system 249% cpu 0.380 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.05s user 0.92s system 249% cpu 0.387 total [1]# sysctl vfs.zfs.vol.recursive=0 vfs.zfs.vol.recursive: 1 -> 0 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.83s system 147% cpu 0.581 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.80s system 240% cpu 0.343 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.79s system 103% cpu 0.785 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.79s system 234% cpu 0.342 total [1]# One more observation: If there is medium to high disk load, such imports take much longer. For example, doing a 'tar cf /dev/null /z/backup/rsync' (with that directory holding a lot of stuff) in parallel, I get (the pool will give a read rate of between 50..200 MB/s, and the machine has 16 GB main memory): [0]# sysctl vfs.zfs.vol.recursive=0 vfs.zfs.vol.recursive: 1 -> 0 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.85s system 12% cpu 6.924 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.81s system 16% cpu 5.110 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.78s system 21% cpu 3.753 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.85s system 11% cpu 7.981 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.02s user 0.81s system 15% cpu 5.366 total [1]# sysctl vfs.zfs.vol.recursive=1 vfs.zfs.vol.recursive: 0 -> 1 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.98s system 9% cpu 10.749 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.97s system 12% cpu 7.945 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.94s system 28% cpu 3.382 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.04s user 0.95s system 8% cpu 11.164 total [1]# sysctl vfs.zfs.vol.recursive=0 vfs.zfs.vol.recursive: 1 -> 0 [0]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.81s system 8% cpu 10.134 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.01s user 0.81s system 11% cpu 7.034 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.80s system 9% cpu 8.948 total [1]# time zpool import nonexistent_pool cannot import 'nonexistent_pool': no such pool available zpool import nonexistent_pool 0.03s user 0.78s system 15% cpu 5.380 total [1]# diff =(strings /lib/libzfs.so.3.ORIG) =(strings /lib/libzfs.so.3) 870a871 > vfs.zfs.vol.recursive 1230a1232 > ZFS::ZVOL 1586a1589 > sysctlbyname 1989d1991 < sysctlbyname 2125,2126d2126 < fff. < ffff. 2128,2130d2127 < l$(I < \$0H < ffffff. 2131a2129,2132 > ffff. > fffff. > fffff. > s0A9 2987a2989 > Lc5= [1]# -- Martin
One more thing: Today, the system was not yet up for a long time and also didn't have much disk activity. Could it be that there is still enough free (main) memory for the zfs buffer cache (or whatever it's called) and therefore all times are relatively low, whereas when such a cache is full the system first has to make space in order to be able to read in the information needed for the import, and freeing such space is continuously negated by continued high disk activity which would fill any space in the buffer cache immediately again? -- Martin
Yes, high disk activity will certainly interfere with the measurements. And more snapshots will make the picture clearer. For my tests, I used one zvol with 3000 snapshots.
A commit references this bug: Author: asomers Date: Tue Jan 28 23:07:32 UTC 2020 New revision: 357235 URL: https://svnweb.freebsd.org/changeset/base/357235 Log: Speed up "zpool import" in the presence of many zvols By default, zpools may not be backed by zvols (that can be changed with the "vfs.zfs.vol.recursive" sysctl). When that sysctl is set to 0, the kernel does not attempt to read zvols when looking for vdevs. But the zpool command still does. This change brings the zpool command into line with the kernel's behavior. It speeds "zpool import" when an already imported pool has many zvols, or a zvol with many snapshots. PR: 241083 Reported by: Martin Birgmeier <d8zNeCFG@aon.at> Reviewed by: mav, Ryan Moeller <ryan@freqlabs.com> MFC after: 2 weeks Sponsored by: Axcient Differential Revision: https://reviews.freebsd.org/D22077 Changes: head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_import.c
A commit references this bug: Author: asomers Date: Thu Feb 13 20:49:46 UTC 2020 New revision: 357881 URL: https://svnweb.freebsd.org/changeset/base/357881 Log: MFC r357235: Speed up "zpool import" in the presence of many zvols By default, zpools may not be backed by zvols (that can be changed with the "vfs.zfs.vol.recursive" sysctl). When that sysctl is set to 0, the kernel does not attempt to read zvols when looking for vdevs. But the zpool command still does. This change brings the zpool command into line with the kernel's behavior. It speeds "zpool import" when an already imported pool has many zvols, or a zvol with many snapshots. PR: 241083 Reported by: Martin Birgmeier <d8zNeCFG@aon.at> Reviewed by: mav, Ryan Moeller <ryan@freqlabs.com> Sponsored by: Axcient Differential Revision: https://reviews.freebsd.org/D22077 Changes: _U stable/12/ stable/12/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_import.c
Reopening because the issue has regressed in FreeBSD 13, due to the OpenZFS import.
OpenZFS PR: https://github.com/openzfs/zfs/pull/11502 .
The OpenZFS PR was accepted, and integrated into FreeBSD 13.0-CURRENT by https://cgit.freebsd.org/src/commit/?id=69d9cc88d213f3d95526636743f15371f2fa6c2c . Still not in stable/13, however.
Merged to stable/13 by 04c4bd7f7b525b1627304894e3413693e945bc0d . Or maybe by 21b12f3e451b0d21ff4a3776cb5d11750c7cc822, the git history is confusing. But it's definitely in there.