Bug 237397

Summary: 'zfs mount -a' mounts filesystems in incorrect order
Product: Base System Reporter: Jim Long <freebsd-bugzilla>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: New ---    
Severity: Affects Some People CC: bsdpr, fullermd, rgrimes, vsasjason
Priority: --- Keywords: regression
Version: 12.0-STABLE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
truss output of 'zfs mount -a' showing incorrect mounting behaviour none

Description Jim Long 2019-04-19 22:54:45 UTC
I have a system that boots from ZFS.  I recently upgraded it from
FreeBSD 11.2-STABLE to 12.0-STABLE.

FreeBSD electron 12.0-STABLE FreeBSD 12.0-STABLE r346293 GENERIC  amd64

I upgraded source, and rebooted into the new 12-STABLE kernel with no
problems.  After installworld and mergemaster, I discovered that certain
ZFS filesystems were not auto-mounting in the correct sequence,
"correct" meaning that /a mounts before /a/b mounts before /a/b/c.

Here are the mountable filesystems on that machine (canmount != off):

electron : 15:36:37 /root# zfs get -Hrt filesystem canmount | grep -vw off | while read FS b; do echo $FS; done | xargs zfs list -o name,canmount,mounted,mountpoint
NAME                           CANMOUNT  MOUNTED  MOUNTPOINT
electron                             on       no  none
electron/data                        on       no  /data
electron/data/backup                 on       no  /data/backup
electron/data/backup/jimsdesk        on       no  /data/backup/jimsdesk
electron/system                      on       no  none
electron/system/ROOT                 on       no  none
electron/system/ROOT/default         on      yes  /
electron/system/home                 on      yes  /home
electron/system/tmp                  on      yes  /tmp
electron/system/usr                  on      yes  /usr
electron/system/usr/local            on      yes  /usr/local
electron/system/var                  on      yes  /var

The easiest tree to demonstrate with is electron/data and children, although
this problem is also happening with electron/system/usr and children.

I have not found any way to query the ZFS stack as to what it thinks the
correct mount order should be, and thus demonstrate whether the
automatic mounting order is correct.  However, I have fashioned this
test script to highlight the observable differences between a correct,
manual mounting, and the incorrect mount order used by zfs mount -a:


--snip--
#!/usr/bin/env bash

# correct mount ordering:
MANUAL="
        zfs mount electron/data
        zfs mount electron/data/backup
        zfs mount electron/data/backup/jimsdesk
"

# automatic mount ordering:
AUTO="
        zfs mount -va
"


case $1
in
   manual)
        MOUNT="$MANUAL"
        ;;

   auto)
        MOUNT="$AUTO"
        ;;
   *)
        exit
esac


cat << EOF | sh -v
        zfs mount | grep '/data'
        mkdir /data
        find  /data
$MOUNT
        zfs mount | grep '/data'
        find /data | head -3000000 | wc -l
        zfs umount /data/backup/jimsdesk
        zfs umount /data/backup
        zfs umount /data
        rmdir /data
        find  /data
EOF
--snip--

Here's a run of that script using the manual mounting order.  The
mountpoint /data is created fresh at the start.  When the filesystems
are correctly mounted, there are over 3 million files under /data.
After unmounting, /data is empty, and can be rmdir'ed.

--snip--
electron : 12:26:00 /root# ./zfs-mount-test.sh manual
        zfs mount | grep '/data'
        mkdir /data
        find  /data
/data

        zfs mount electron/data
        zfs mount electron/data/backup
        zfs mount electron/data/backup/jimsdesk

        zfs mount | grep '/data'
electron/data                   /data
electron/data/backup            /data/backup
electron/data/backup/jimsdesk   /data/backup/jimsdesk
        find /data | head -3000000 | wc -l
 3000000
        zfs umount /data/backup/jimsdesk
        zfs umount /data/backup
        zfs umount /data
        rmdir /data
        find  /data
find: /data: No such file or directory
--snip--

Here's a run of that script using the automatic mounting order.  When
the filesystems are automatically mounted, fewer than 1.5 million files
are present under /data.  Unmounting everything generates an error
message, /data is not clean, and rmdir returns an error.

--snip--
electron : 12:26:17 /root# ./zfs-mount-test.sh auto
        zfs mount | grep '/data'
        mkdir /data
        find  /data
/data

        zfs mount -va

        zfs mount | grep '/data'
electron/data                   /data
electron/data/backup            /data/backup
electron/data/backup/jimsdesk   /data/backup/jimsdesk
        find /data | head -3000000 | wc -l
 1481445
        zfs umount /data/backup/jimsdesk
        zfs umount /data/backup
cannot unmount '/data/backup': not a mountpoint
        zfs umount /data
        rmdir /data
rmdir: /data: Directory not empty
        find  /data
/data
/data/backup
--snip--


Please let me know how I can provide additional information or testing.

Thank you!

Jim
Comment 2 Jim Long 2019-04-28 21:21:17 UTC
Adding Rod Grimes to CC list at his request.
Comment 3 Jim Long 2019-05-05 19:18:52 UTC
Created attachment 204229 [details]
truss output of 'zfs mount -a' showing incorrect mounting behaviour

Another view of the behaviour:

System running on ZFS:

FreeBSD electron 12.0-STABLE FreeBSD 12.0-STABLE r346293 GENERIC  amd64

# zfs mount
electron/system/ROOT/default    /
electron/system/var             /var
electron/system/home            /home
electron/system/usr             /usr
electron/system/tmp             /tmp
electron/system/usr/local       /usr/local

Of the entire set of mountable filesystems (canmount != off and mountpoint !=
none), there are three filesystems not yet mounted.  Those three mountpoints
are /data, /data/backup, and /data/backup/jimsdesk:

# zfs list -o canmount,mounted,mountpoint,name | egrep -vw '^[[:blank:]]*off'
CANMOUNT  MOUNTED  MOUNTPOINT               NAME
      on       no  none                     electron
      on       no  /data                    electron/data
      on       no  /data/backup             electron/data/backup
      on       no  /data/backup/jimsdesk    electron/data/backup/jimsdesk
      on       no  none                     electron/system
      on       no  none                     electron/system/ROOT
      on      yes  /                        electron/system/ROOT/default
      on      yes  /home                    electron/system/home
      on      yes  /tmp                     electron/system/tmp
      on      yes  /usr                     electron/system/usr
      on      yes  /usr/local               electron/system/usr/local
      on      yes  /var                     electron/system/var

Mountpoint /data does not exist in /:

# find /data -ls | wc -l
find: /data: No such file or directory
       0

After manually mounting electron/data, /data is created, filesystem content
appears, including an empty mountpoint directory for the next filesystem,
electron/data/backup:

# zfs mount electron/data
# find /data | wc -l
       8
# find /data/backup -ls
     8        1 drwxr-xr-x    2 root         wheel     2 May  3 12:55 /data/backup

After mounting electron/data/backup, filesystem content is present under the
mountpoint /data/backup, including an empty mountpoint directory for the next
filesystem, electron/data/backup/jimsdesk:

# find /data/backup | head -3000000 | wc -l
 3000000
# find /data/backup/jimsdesk -ls 
7904435        1 drwxr-xr-x    2 root         wheel     2 Mar  5 16:48 /data/backup/jimsdesk

After mounting electron/data/backup/jimsdesk, that filesystem content also
appears under /data/backup/jimsdesk:

# zfs mount electron/data/backup/jimsdesk
# find /data/backup/jimsdesk | head -3000000 | wc -l
 1481437

That's all excellent.  Manual mounting works.  Let's clean up:

# zfs umount /data/backup/jimsdesk
# find /data/backup/jimsdesk -ls
7904435        1 drwxr-xr-x    2 root         wheel     2 Mar  5 16:48 /data/backup/jimsdesk
# zfs umount /data/backup
# find /data/backup -ls
     8        1 drwxr-xr-x    2 root         wheel     2 May  3 12:55 /data/backup
# zfs umount /data 
# find /data -ls
  1488        1 drwxr-xr-x    2 root         wheel     2 May  3 13:21 /data
# rmdir /data
# find /data -ls
find: /data: No such file or directory

Now we're back to our original mount state: everything mounted except /data
/data/backup and /data/backup/jimsdesk:

# zfs list -o canmount,mounted,mountpoint,name | egrep -vw '^[[:blank:]]*off'
CANMOUNT  MOUNTED  MOUNTPOINT               NAME
      on       no  none                     electron
      on       no  /data                    electron/data
      on       no  /data/backup             electron/data/backup
      on       no  /data/backup/jimsdesk    electron/data/backup/jimsdesk
      on       no  none                     electron/system
      on       no  none                     electron/system/ROOT
      on      yes  /                        electron/system/ROOT/default
      on      yes  /home                    electron/system/home
      on      yes  /tmp                     electron/system/tmp
      on      yes  /usr                     electron/system/usr
      on      yes  /usr/local               electron/system/usr/local
      on      yes  /var                     electron/system/var

Now let's 'truss zfs mount -a'.  Instead of inlining all the output, it's attached.

# truss zfs mount -a > truss-zfs-mount-a.out 2>&1
# egrep 'mount|mkdir' truss-zfs-mount-a.out 
mkdir("/data/backup",0755)                       ERR#2 'No such file or directory'
mkdir("/data",0755)                              = 0 (0x0)
mkdir("/data/backup",0755)                       = 0 (0x0)
nmount(0x801653000,8,0x0)                        = 0 (0x0)
mkdir("/data",0755)                              ERR#17 'File exists'
cannot mount '/data': failed to create mountpoint
write(2,"cannot mount '/data': failed to "...,50) = 50 (0x32)
nmount(0x801672000,8,0x0)                        = 0 (0x0)

After 'zfs mount -a' only two of the expected three filesystems are mounted:

electron : 13:35:05 /root# zfs list -o canmount,mounted,mountpoint,name | egrep -vw '^[[:blank:]]*off'
CANMOUNT  MOUNTED  MOUNTPOINT               NAME
      on       no  none                     electron
      on       no  /data                    electron/data
      on      yes  /data/backup             electron/data/backup
      on      yes  /data/backup/jimsdesk    electron/data/backup/jimsdesk
      on       no  none                     electron/system
      on       no  none                     electron/system/ROOT
      on      yes  /                        electron/system/ROOT/default
      on      yes  /home                    electron/system/home
      on      yes  /tmp                     electron/system/tmp
      on      yes  /usr                     electron/system/usr
      on      yes  /usr/local               electron/system/usr/local
      on      yes  /var                     electron/system/var

electron/data is still not mounted, as per the 'cannot mount ...' error from
'zfs mount -a'.  The truss output also shows that zfs was confused about the
order in which to create mountpoints, and attempted to create the second-tier
mountpoint (/data/backup) before creating the first-tier mountpoint (/data).

The truss output file is attached to this PR.

So, although the mode of failure varies a little, I hope that this truss file
will be sufficient for someone to look at and diagnose that 'zfs mount -a' is
not doing things correctly.  Let me know if I can provide additional info.

Thank you!

Jim
Comment 4 Bertrand Petit 2019-07-20 09:58:25 UTC
This bug also affects 11.3-STABLE, see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239243
Comment 5 fullermd 2019-07-21 17:07:19 UTC
See earlier ref'd bug 237517 now contains a fix from ZoL.
Comment 6 commit-hook freebsd_committer freebsd_triage 2019-07-26 13:12:46 UTC
A commit references this bug:

Author: bapt
Date: Fri Jul 26 13:12:33 UTC 2019
New revision: 350358
URL: https://svnweb.freebsd.org/changeset/base/350358

Log:
  Fix a bug introduced with parallel mounting of zfs

  Incorporate a fix from zol:
  https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce

  commit log from upstream:
   Fix race in parallel mount's thread dispatching algorithm

  Strategy of parallel mount is as follows.

  1) Initial thread dispatching is to select sets of mount points that
   don't have dependencies on other sets, hence threads can/should run
   lock-less and shouldn't race with other threads for other sets. Each
   thread dispatched corresponds to top level directory which may or may
   not have datasets to be mounted on sub directories.

  2) Subsequent recursive thread dispatching for each thread from 1)
   is to mount datasets for each set of mount points. The mount points
   within each set have dependencies (i.e. child directories), so child
   directories are processed only after parent directory completes.

  The problem is that the initial thread dispatching in
  zfs_foreach_mountpoint() can be multi-threaded when it needs to be
  single-threaded, and this puts threads under race condition. This race
  appeared as mount/unmount issues on ZoL for ZoL having different
  timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
  `zfs unmount -a` which expects proper mount order can't unmount if the
  mounts were reordered by the race condition.

  There are currently two known patterns of input list `handles` in
  `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

  1) #8833 case where input is `/a /a /a/b` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list with two same top level directories.
   There is a race between two POSIX threads A and B,
    * ThreadA for "/a" for test1 and "/a/b"
    * ThreadB for "/a" for test0/a
   and in case of #8833, ThreadA won the race. Two threads were created
   because "/a" wasn't considered as `"/a" contains "/a"`.

  2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list containing "/".
   There is a race between two POSIX threads A and B,
    * ThreadA for "/" and "/var/data/test"
    * ThreadB for "/var/data"
   and in case of #8450, ThreadA won the race. Two threads were created
   because "/var/data" wasn't considered as `"/" contains "/var/data"`.
   In other words, if there is (at least one) "/" in the input list,
   the initial thread dispatching must be single-threaded since every
   directory is a child of "/", meaning they all directly or indirectly
   depend on "/".

  In both cases, the first non_descendant_idx() call fails to correctly
  determine "path1-contains-path2", and as a result the initial thread
  dispatching creates another thread when it needs to be single-threaded.
  Fix a conditional in libzfs_path_contains() to consider above two.

  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
  Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>

  PR:		237517, 237397, 239243
  Submitted by:	Matthew D. Fuller <fullermd@over-yonder.net> (by email)
  MFC after:	3 days

Changes:
  head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
Comment 7 commit-hook freebsd_committer freebsd_triage 2019-07-29 08:15:00 UTC
A commit references this bug:

Author: bapt
Date: Mon Jul 29 08:14:36 UTC 2019
New revision: 350401
URL: https://svnweb.freebsd.org/changeset/base/350401

Log:
  MFC r350358:

  Fix a bug introduced with parallel mounting of zfs

  Incorporate a fix from zol:
  https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce

  commit log from upstream:
   Fix race in parallel mount's thread dispatching algorithm

  Strategy of parallel mount is as follows.

  1) Initial thread dispatching is to select sets of mount points that
   don't have dependencies on other sets, hence threads can/should run
   lock-less and shouldn't race with other threads for other sets. Each
   thread dispatched corresponds to top level directory which may or may
   not have datasets to be mounted on sub directories.

  2) Subsequent recursive thread dispatching for each thread from 1)
   is to mount datasets for each set of mount points. The mount points
   within each set have dependencies (i.e. child directories), so child
   directories are processed only after parent directory completes.

  The problem is that the initial thread dispatching in
  zfs_foreach_mountpoint() can be multi-threaded when it needs to be
  single-threaded, and this puts threads under race condition. This race
  appeared as mount/unmount issues on ZoL for ZoL having different
  timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
  `zfs unmount -a` which expects proper mount order can't unmount if the
  mounts were reordered by the race condition.

  There are currently two known patterns of input list `handles` in
  `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

  1) #8833 case where input is `/a /a /a/b` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list with two same top level directories.
   There is a race between two POSIX threads A and B,
    * ThreadA for "/a" for test1 and "/a/b"
    * ThreadB for "/a" for test0/a
   and in case of #8833, ThreadA won the race. Two threads were created
   because "/a" wasn't considered as `"/a" contains "/a"`.

  2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list containing "/".
   There is a race between two POSIX threads A and B,
    * ThreadA for "/" and "/var/data/test"
    * ThreadB for "/var/data"
   and in case of #8450, ThreadA won the race. Two threads were created
   because "/var/data" wasn't considered as `"/" contains "/var/data"`.
   In other words, if there is (at least one) "/" in the input list,
   the initial thread dispatching must be single-threaded since every
   directory is a child of "/", meaning they all directly or indirectly
   depend on "/".

  In both cases, the first non_descendant_idx() call fails to correctly
  determine "path1-contains-path2", and as a result the initial thread
  dispatching creates another thread when it needs to be single-threaded.
  Fix a conditional in libzfs_path_contains() to consider above two.

  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
  Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>

  PR:  237517, 237397, 239243
  Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email)

Changes:
_U  stable/12/
  stable/12/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
Comment 8 commit-hook freebsd_committer freebsd_triage 2019-07-29 08:24:13 UTC
A commit references this bug:

Author: bapt
Date: Mon Jul 29 08:23:15 UTC 2019
New revision: 350402
URL: https://svnweb.freebsd.org/changeset/base/350402

Log:
  MFC r350358:

  Fix a bug introduced with parallel mounting of zfs

  Incorporate a fix from zol:
  https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce

  commit log from upstream:
   Fix race in parallel mount's thread dispatching algorithm

  Strategy of parallel mount is as follows.

  1) Initial thread dispatching is to select sets of mount points that
   don't have dependencies on other sets, hence threads can/should run
   lock-less and shouldn't race with other threads for other sets. Each
   thread dispatched corresponds to top level directory which may or may
   not have datasets to be mounted on sub directories.

  2) Subsequent recursive thread dispatching for each thread from 1)
   is to mount datasets for each set of mount points. The mount points
   within each set have dependencies (i.e. child directories), so child
   directories are processed only after parent directory completes.

  The problem is that the initial thread dispatching in
  zfs_foreach_mountpoint() can be multi-threaded when it needs to be
  single-threaded, and this puts threads under race condition. This race
  appeared as mount/unmount issues on ZoL for ZoL having different
  timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
  `zfs unmount -a` which expects proper mount order can't unmount if the
  mounts were reordered by the race condition.

  There are currently two known patterns of input list `handles` in
  `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

  1) #8833 case where input is `/a /a /a/b` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list with two same top level directories.
   There is a race between two POSIX threads A and B,
    * ThreadA for "/a" for test1 and "/a/b"
    * ThreadB for "/a" for test0/a
   and in case of #8833, ThreadA won the race. Two threads were created
   because "/a" wasn't considered as `"/a" contains "/a"`.

  2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list containing "/".
   There is a race between two POSIX threads A and B,
    * ThreadA for "/" and "/var/data/test"
    * ThreadB for "/var/data"
   and in case of #8450, ThreadA won the race. Two threads were created
   because "/var/data" wasn't considered as `"/" contains "/var/data"`.
   In other words, if there is (at least one) "/" in the input list,
   the initial thread dispatching must be single-threaded since every
   directory is a child of "/", meaning they all directly or indirectly
   depend on "/".

  In both cases, the first non_descendant_idx() call fails to correctly
  determine "path1-contains-path2", and as a result the initial thread
  dispatching creates another thread when it needs to be single-threaded.
  Fix a conditional in libzfs_path_contains() to consider above two.

  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
  Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>

  PR:  237517, 237397, 239243
  Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email)

Changes:
_U  stable/11/
  stable/11/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
Comment 9 Jim Long 2019-11-06 18:14:39 UTC
Based on my testing, it appears that this bug has been fixed.  If no one has objections, this PR can be closed.