I have a system that boots from ZFS. I recently upgraded it from FreeBSD 11.2-STABLE to 12.0-STABLE. FreeBSD electron 12.0-STABLE FreeBSD 12.0-STABLE r346293 GENERIC amd64 I upgraded source, and rebooted into the new 12-STABLE kernel with no problems. After installworld and mergemaster, I discovered that certain ZFS filesystems were not auto-mounting in the correct sequence, "correct" meaning that /a mounts before /a/b mounts before /a/b/c. Here are the mountable filesystems on that machine (canmount != off): electron : 15:36:37 /root# zfs get -Hrt filesystem canmount | grep -vw off | while read FS b; do echo $FS; done | xargs zfs list -o name,canmount,mounted,mountpoint NAME CANMOUNT MOUNTED MOUNTPOINT electron on no none electron/data on no /data electron/data/backup on no /data/backup electron/data/backup/jimsdesk on no /data/backup/jimsdesk electron/system on no none electron/system/ROOT on no none electron/system/ROOT/default on yes / electron/system/home on yes /home electron/system/tmp on yes /tmp electron/system/usr on yes /usr electron/system/usr/local on yes /usr/local electron/system/var on yes /var The easiest tree to demonstrate with is electron/data and children, although this problem is also happening with electron/system/usr and children. I have not found any way to query the ZFS stack as to what it thinks the correct mount order should be, and thus demonstrate whether the automatic mounting order is correct. However, I have fashioned this test script to highlight the observable differences between a correct, manual mounting, and the incorrect mount order used by zfs mount -a: --snip-- #!/usr/bin/env bash # correct mount ordering: MANUAL=" zfs mount electron/data zfs mount electron/data/backup zfs mount electron/data/backup/jimsdesk " # automatic mount ordering: AUTO=" zfs mount -va " case $1 in manual) MOUNT="$MANUAL" ;; auto) MOUNT="$AUTO" ;; *) exit esac cat << EOF | sh -v zfs mount | grep '/data' mkdir /data find /data $MOUNT zfs mount | grep '/data' find /data | head -3000000 | wc -l zfs umount /data/backup/jimsdesk zfs umount /data/backup zfs umount /data rmdir /data find /data EOF --snip-- Here's a run of that script using the manual mounting order. The mountpoint /data is created fresh at the start. When the filesystems are correctly mounted, there are over 3 million files under /data. After unmounting, /data is empty, and can be rmdir'ed. --snip-- electron : 12:26:00 /root# ./zfs-mount-test.sh manual zfs mount | grep '/data' mkdir /data find /data /data zfs mount electron/data zfs mount electron/data/backup zfs mount electron/data/backup/jimsdesk zfs mount | grep '/data' electron/data /data electron/data/backup /data/backup electron/data/backup/jimsdesk /data/backup/jimsdesk find /data | head -3000000 | wc -l 3000000 zfs umount /data/backup/jimsdesk zfs umount /data/backup zfs umount /data rmdir /data find /data find: /data: No such file or directory --snip-- Here's a run of that script using the automatic mounting order. When the filesystems are automatically mounted, fewer than 1.5 million files are present under /data. Unmounting everything generates an error message, /data is not clean, and rmdir returns an error. --snip-- electron : 12:26:17 /root# ./zfs-mount-test.sh auto zfs mount | grep '/data' mkdir /data find /data /data zfs mount -va zfs mount | grep '/data' electron/data /data electron/data/backup /data/backup electron/data/backup/jimsdesk /data/backup/jimsdesk find /data | head -3000000 | wc -l 1481445 zfs umount /data/backup/jimsdesk zfs umount /data/backup cannot unmount '/data/backup': not a mountpoint zfs umount /data rmdir /data rmdir: /data: Directory not empty find /data /data /data/backup --snip-- Please let me know how I can provide additional information or testing. Thank you! Jim
See also https://lists.freebsd.org/pipermail/freebsd-stable/2019-April/090898.html See also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237517
Adding Rod Grimes to CC list at his request.
Created attachment 204229 [details] truss output of 'zfs mount -a' showing incorrect mounting behaviour Another view of the behaviour: System running on ZFS: FreeBSD electron 12.0-STABLE FreeBSD 12.0-STABLE r346293 GENERIC amd64 # zfs mount electron/system/ROOT/default / electron/system/var /var electron/system/home /home electron/system/usr /usr electron/system/tmp /tmp electron/system/usr/local /usr/local Of the entire set of mountable filesystems (canmount != off and mountpoint != none), there are three filesystems not yet mounted. Those three mountpoints are /data, /data/backup, and /data/backup/jimsdesk: # zfs list -o canmount,mounted,mountpoint,name | egrep -vw '^[[:blank:]]*off' CANMOUNT MOUNTED MOUNTPOINT NAME on no none electron on no /data electron/data on no /data/backup electron/data/backup on no /data/backup/jimsdesk electron/data/backup/jimsdesk on no none electron/system on no none electron/system/ROOT on yes / electron/system/ROOT/default on yes /home electron/system/home on yes /tmp electron/system/tmp on yes /usr electron/system/usr on yes /usr/local electron/system/usr/local on yes /var electron/system/var Mountpoint /data does not exist in /: # find /data -ls | wc -l find: /data: No such file or directory 0 After manually mounting electron/data, /data is created, filesystem content appears, including an empty mountpoint directory for the next filesystem, electron/data/backup: # zfs mount electron/data # find /data | wc -l 8 # find /data/backup -ls 8 1 drwxr-xr-x 2 root wheel 2 May 3 12:55 /data/backup After mounting electron/data/backup, filesystem content is present under the mountpoint /data/backup, including an empty mountpoint directory for the next filesystem, electron/data/backup/jimsdesk: # find /data/backup | head -3000000 | wc -l 3000000 # find /data/backup/jimsdesk -ls 7904435 1 drwxr-xr-x 2 root wheel 2 Mar 5 16:48 /data/backup/jimsdesk After mounting electron/data/backup/jimsdesk, that filesystem content also appears under /data/backup/jimsdesk: # zfs mount electron/data/backup/jimsdesk # find /data/backup/jimsdesk | head -3000000 | wc -l 1481437 That's all excellent. Manual mounting works. Let's clean up: # zfs umount /data/backup/jimsdesk # find /data/backup/jimsdesk -ls 7904435 1 drwxr-xr-x 2 root wheel 2 Mar 5 16:48 /data/backup/jimsdesk # zfs umount /data/backup # find /data/backup -ls 8 1 drwxr-xr-x 2 root wheel 2 May 3 12:55 /data/backup # zfs umount /data # find /data -ls 1488 1 drwxr-xr-x 2 root wheel 2 May 3 13:21 /data # rmdir /data # find /data -ls find: /data: No such file or directory Now we're back to our original mount state: everything mounted except /data /data/backup and /data/backup/jimsdesk: # zfs list -o canmount,mounted,mountpoint,name | egrep -vw '^[[:blank:]]*off' CANMOUNT MOUNTED MOUNTPOINT NAME on no none electron on no /data electron/data on no /data/backup electron/data/backup on no /data/backup/jimsdesk electron/data/backup/jimsdesk on no none electron/system on no none electron/system/ROOT on yes / electron/system/ROOT/default on yes /home electron/system/home on yes /tmp electron/system/tmp on yes /usr electron/system/usr on yes /usr/local electron/system/usr/local on yes /var electron/system/var Now let's 'truss zfs mount -a'. Instead of inlining all the output, it's attached. # truss zfs mount -a > truss-zfs-mount-a.out 2>&1 # egrep 'mount|mkdir' truss-zfs-mount-a.out mkdir("/data/backup",0755) ERR#2 'No such file or directory' mkdir("/data",0755) = 0 (0x0) mkdir("/data/backup",0755) = 0 (0x0) nmount(0x801653000,8,0x0) = 0 (0x0) mkdir("/data",0755) ERR#17 'File exists' cannot mount '/data': failed to create mountpoint write(2,"cannot mount '/data': failed to "...,50) = 50 (0x32) nmount(0x801672000,8,0x0) = 0 (0x0) After 'zfs mount -a' only two of the expected three filesystems are mounted: electron : 13:35:05 /root# zfs list -o canmount,mounted,mountpoint,name | egrep -vw '^[[:blank:]]*off' CANMOUNT MOUNTED MOUNTPOINT NAME on no none electron on no /data electron/data on yes /data/backup electron/data/backup on yes /data/backup/jimsdesk electron/data/backup/jimsdesk on no none electron/system on no none electron/system/ROOT on yes / electron/system/ROOT/default on yes /home electron/system/home on yes /tmp electron/system/tmp on yes /usr electron/system/usr on yes /usr/local electron/system/usr/local on yes /var electron/system/var electron/data is still not mounted, as per the 'cannot mount ...' error from 'zfs mount -a'. The truss output also shows that zfs was confused about the order in which to create mountpoints, and attempted to create the second-tier mountpoint (/data/backup) before creating the first-tier mountpoint (/data). The truss output file is attached to this PR. So, although the mode of failure varies a little, I hope that this truss file will be sufficient for someone to look at and diagnose that 'zfs mount -a' is not doing things correctly. Let me know if I can provide additional info. Thank you! Jim
This bug also affects 11.3-STABLE, see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239243
See earlier ref'd bug 237517 now contains a fix from ZoL.
A commit references this bug: Author: bapt Date: Fri Jul 26 13:12:33 UTC 2019 New revision: 350358 URL: https://svnweb.freebsd.org/changeset/base/350358 Log: Fix a bug introduced with parallel mounting of zfs Incorporate a fix from zol: https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce commit log from upstream: Fix race in parallel mount's thread dispatching algorithm Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> PR: 237517, 237397, 239243 Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email) MFC after: 3 days Changes: head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
A commit references this bug: Author: bapt Date: Mon Jul 29 08:14:36 UTC 2019 New revision: 350401 URL: https://svnweb.freebsd.org/changeset/base/350401 Log: MFC r350358: Fix a bug introduced with parallel mounting of zfs Incorporate a fix from zol: https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce commit log from upstream: Fix race in parallel mount's thread dispatching algorithm Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> PR: 237517, 237397, 239243 Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email) Changes: _U stable/12/ stable/12/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
A commit references this bug: Author: bapt Date: Mon Jul 29 08:23:15 UTC 2019 New revision: 350402 URL: https://svnweb.freebsd.org/changeset/base/350402 Log: MFC r350358: Fix a bug introduced with parallel mounting of zfs Incorporate a fix from zol: https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce commit log from upstream: Fix race in parallel mount's thread dispatching algorithm Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> PR: 237517, 237397, 239243 Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email) Changes: _U stable/11/ stable/11/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
Based on my testing, it appears that this bug has been fixed. If no one has objections, this PR can be closed.