Bug 237517 - ZFS parallel mounting sometimes misses mounting intermediate filesystems
Summary: ZFS parallel mounting sometimes misses mounting intermediate filesystems
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-24 07:46 UTC by Trond.Endrestol
Modified: 2019-07-29 08:24 UTC (History)
7 users (show)

See Also:


Attachments
Patch for libexec/rc/rc.d/zfs making parallel mount optional (1.90 KB, patch)
2019-05-20 11:25 UTC, Trond.Endrestol
no flags Details | Diff
Corrected patch for libexec/rc/rc.d/zfs making parallel mount optional (1.91 KB, patch)
2019-05-20 15:11 UTC, Trond.Endrestol
no flags Details | Diff
Streamlined the patch for libexec/rc/rc.d/zfs making parallel mount optional (1.22 KB, patch)
2019-05-21 09:33 UTC, Trond.Endrestol
no flags Details | Diff
Shell script to mount all filesystems in a zpool (the pool for / by default) (1.27 KB, text/plain)
2019-05-21 22:18 UTC, Peter Eriksson
no flags Details
Script to parallel-mount zfs filesystems (uses "parallel") (1009 bytes, text/plain)
2019-05-21 22:19 UTC, Peter Eriksson
no flags Details
Hack to serialize libzfs's mounting (656 bytes, patch)
2019-06-02 21:22 UTC, fullermd
no flags Details | Diff
Repro script (run from /) (1.41 KB, text/plain)
2019-07-17 03:38 UTC, fullermd
no flags Details
ZoL fix (610 bytes, patch)
2019-07-18 02:36 UTC, fullermd
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Trond.Endrestol 2019-04-24 07:46:24 UTC
Both Jim Long and myself experience ZFS parallel mount misses mounting intermediate filesystems.

As far as I can tell, this behaviour is not deterministic, sometimes the parallel mounting succeeds and sometimes it doesn't.

See these threads, https://lists.freebsd.org/pipermail/freebsd-stable/2019-April/090898.html and https://lists.freebsd.org/pipermail/freebsd-questions/2019-April/284974.html.

Here are a couple of suggestions:

Can we make parallel mounting optional for us that don't need it or want it?
Or, can parallel mounting be changed so that filesystems off the root pool is completely mounted before all other pools?

This still leaves the possibility of the kernel attempting to mount pool/a/b before pool/a, even if both filesystems have their canmount properties set to on, and one being a child of the other.
Comment 1 Trond.Endrestol 2019-04-25 09:16:01 UTC
I created a new VM to play with. It's running the 12.0-STABLE amd64 snapshot of 2019-04-18.

Here's the output of df -ah after a normal boot:

Filesystem                              Size    Used   Avail Capacity  Mounted on
zroot/ROOT/20190418-r346338              10G    1.7G    8.5G    16%    /
devfs                                   1.0K    1.0K      0B   100%    /dev
zdata/usr/local/www                     2.6G     88K    2.6G     0%    /usr/local/www
zdata/usr/local/pgsql                   2.6G     88K    2.6G     0%    /usr/local/pgsql
zdata/home                              2.6G     88K    2.6G     0%    /home
zdata/usr/local/moodledata              2.6G     88K    2.6G     0%    /usr/local/moodledata
zdata/var/db/bacula                     2.6G     88K    2.6G     0%    /var/db/bacula
zdata/var/db/mysql                      2.6G     88K    2.6G     0%    /var/db/mysql
zdata/var/db/darkstat                   2.6G     88K    2.6G     0%    /var/db/darkstat
zdata/usr/local/pgsql/data              2.6G     88K    2.6G     0%    /usr/local/pgsql/data
zdata/var/db/boinc                      2.6G     88K    2.6G     0%    /var/db/boinc
zdata/var/db/mysql_tmpdir               2.6G     88K    2.6G     0%    /var/db/mysql_tmpdir
zdata/var/db/postgres                   2.6G     88K    2.6G     0%    /var/db/postgres
zdata/var/db/mysql_secure               2.6G     88K    2.6G     0%    /var/db/mysql_secure
zdata/usr/local/pgsql/data/base         2.6G     88K    2.6G     0%    /usr/local/pgsql/data/base
zdata/var/db/prometheus                 2.6G     88K    2.6G     0%    /var/db/prometheus
zdata/var/db/bareos                     2.6G     88K    2.6G     0%    /var/db/bareos
zdata/usr/local/pgsql/data/pg_xlog      2.6G     88K    2.6G     0%    /usr/local/pgsql/data/pg_xlog
zdata/var/db/postgres/data10            2.6G     88K    2.6G     0%    /var/db/postgres/data10
zroot/usr/compat                        8.5G     88K    8.5G     0%    /usr/compat
zdata/var/db/postgres/data11            2.6G     88K    2.6G     0%    /var/db/postgres/data11
zroot/media                             8.5G     96K    8.5G     0%    /media
zroot/var/empty                         8.5G     88K    8.5G     0%    /var/empty
zroot/usr/obj                           8.5G     88K    8.5G     0%    /usr/obj
zroot/var/audit                         8.5G     88K    8.5G     0%    /var/audit
zroot/usr/src                           8.5G     88K    8.5G     0%    /usr/src
zroot/usr/ports                         8.5G     88K    8.5G     0%    /usr/ports
zroot/var/unbound                       8.5G     88K    8.5G     0%    /var/unbound
zroot/var/crash                         8.5G     92K    8.5G     0%    /var/crash
zroot/nfs                               8.5G     88K    8.5G     0%    /nfs
zroot/var/db/ports                      8.5G     88K    8.5G     0%    /var/db/ports
zroot/var/backups                       8.5G     88K    8.5G     0%    /var/backups
zroot/var/db/fontconfig                 8.5G     88K    8.5G     0%    /var/db/fontconfig
zdata/var/db/prometheus/data            2.6G     88K    2.6G     0%    /var/db/prometheus/data
zroot/var/db/ntp                        8.5G     88K    8.5G     0%    /var/db/ntp
zroot/var/mail                          8.5G     88K    8.5G     0%    /var/mail
zroot/usr/local                         8.5G     96K    8.5G     0%    /usr/local
zroot/usr/docker                        8.5G     88K    8.5G     0%    /usr/docker
zroot/var/Named                         8.5G     88K    8.5G     0%    /var/Named
zroot/usr/local/var                     8.5G     88K    8.5G     0%    /usr/local/var
zroot/var/db/etcupdate                  8.5G    2.1M    8.5G     0%    /var/db/etcupdate
zroot/usr/local/etc                     8.5G     88K    8.5G     0%    /usr/local/etc
zroot/var/cache/ccache                  8.5G     88K    8.5G     0%    /var/cache/ccache
zroot/var/run                           8.5G    144K    8.5G     0%    /var/run
zroot/var/db/portsnap                   8.5G     88K    8.5G     0%    /var/db/portsnap
zroot/usr/local/certs                   8.5G     88K    8.5G     0%    /usr/local/certs
zroot/var/db/tlpkg                      8.5G     88K    8.5G     0%    /var/db/tlpkg
zroot/var/munin                         8.5G     88K    8.5G     0%    /var/munin
zroot/var/db/hyperv                     8.5G     88K    8.5G     0%    /var/db/hyperv
zroot/var/spool                         8.5G     96K    8.5G     0%    /var/spool
zroot/var/tmp                           8.5G     88K    8.5G     0%    /var/tmp
zroot/usr/local/tests                   8.5G     88K    8.5G     0%    /usr/local/tests
zroot/tmp                               8.5G    120K    8.5G     0%    /tmp
zroot/var/cache/synth                   8.5G     88K    8.5G     0%    /var/cache/synth
zroot/var/lib                           8.5G     88K    8.5G     0%    /var/lib
zroot/var/cache/pkg                     8.5G     88K    8.5G     0%    /var/cache/pkg
zroot/usr/local/info                    8.5G     88K    8.5G     0%    /usr/local/info
zroot/var/log                           8.5G    108K    8.5G     0%    /var/log
zroot/var/account                       8.5G     88K    8.5G     0%    /var/account
zroot/var/synth                         8.5G     88K    8.5G     0%    /var/synth
zroot/var/db/pkg                        8.5G     88K    8.5G     0%    /var/db/pkg
zdata/var/db/postgres/data10/base       2.6G     88K    2.6G     0%    /var/db/postgres/data10/base
zdata/var/db/postgres/data96            2.6G     88K    2.6G     0%    /var/db/postgres/data96
zdata/var/spool/bareos                  2.6G     88K    2.6G     0%    /var/spool/bareos
zdata/var/db/postgres/data96/pg_xlog    2.6G     88K    2.6G     0%    /var/db/postgres/data96/pg_xlog
zroot/var/spool/mqueue                  8.5G     88K    8.5G     0%    /var/spool/mqueue
zroot/var/spool/clientmqueue            8.5G     88K    8.5G     0%    /var/spool/clientmqueue
zdata/var/db/postgres/data11/base       2.6G     88K    2.6G     0%    /var/db/postgres/data11/base
zdata/var/spool/ftp                     2.6G     88K    2.6G     0%    /var/spool/ftp
zdata/var/db/postgres/data96/base       2.6G     88K    2.6G     0%    /var/db/postgres/data96/base
zroot/usr/ports/local                   8.5G     88K    8.5G     0%    /usr/ports/local
zroot/usr/ports/workdirs                8.5G     88K    8.5G     0%    /usr/ports/workdirs
zdata/var/db/postgres/data11/pg_wal     2.6G     88K    2.6G     0%    /var/db/postgres/data11/pg_wal
zroot/var/synth/builders                8.5G     88K    8.5G     0%    /var/synth/builders
zdata/var/db/prometheus/data/wal        2.6G     88K    2.6G     0%    /var/db/prometheus/data/wal
zroot/var/synth/live_packages           8.5G     88K    8.5G     0%    /var/synth/live_packages
zroot/usr/ports/distfiles               8.5G     88K    8.5G     0%    /usr/ports/distfiles
zdata/var/db/postgres/data10/pg_wal     2.6G     88K    2.6G     0%    /var/db/postgres/data10/pg_wal
zroot/usr/ports/packages                8.5G     88K    8.5G     0%    /usr/ports/packages
zroot/usr/compat/linux                  8.5G     88K    8.5G     0%    /usr/compat/linux

Some filesystem are mounted ahead of their intermediary filesystems, and the intermediary filesystems once mounted will effectively hide their children. This can not continue.

This is the same output after I had remounted the filesystems in the correct order:

Filesystem                              Size    Used   Avail Capacity  Mounted on
zroot/ROOT/20190418-r346338              10G    1.7G    8.5G    16%    /
devfs                                   1.0K    1.0K      0B   100%    /dev
zroot/media                             8.5G     96K    8.5G     0%    /media
zroot/nfs                               8.5G     88K    8.5G     0%    /nfs
zroot/tmp                               8.5G    116K    8.5G     0%    /tmp
zroot/usr/compat                        8.5G     88K    8.5G     0%    /usr/compat
zroot/usr/compat/linux                  8.5G     88K    8.5G     0%    /usr/compat/linux
zroot/usr/docker                        8.5G     88K    8.5G     0%    /usr/docker
zroot/usr/local                         8.5G     96K    8.5G     0%    /usr/local
zroot/usr/local/certs                   8.5G     88K    8.5G     0%    /usr/local/certs
zroot/usr/local/etc                     8.5G     88K    8.5G     0%    /usr/local/etc
zroot/usr/local/info                    8.5G     88K    8.5G     0%    /usr/local/info
zroot/usr/local/tests                   8.5G     88K    8.5G     0%    /usr/local/tests
zroot/usr/local/var                     8.5G     88K    8.5G     0%    /usr/local/var
zroot/usr/obj                           8.5G     88K    8.5G     0%    /usr/obj
zroot/usr/ports                         8.5G     88K    8.5G     0%    /usr/ports
zroot/usr/ports/distfiles               8.5G     88K    8.5G     0%    /usr/ports/distfiles
zroot/usr/ports/local                   8.5G     88K    8.5G     0%    /usr/ports/local
zroot/usr/ports/packages                8.5G     88K    8.5G     0%    /usr/ports/packages
zroot/usr/ports/workdirs                8.5G     88K    8.5G     0%    /usr/ports/workdirs
zroot/usr/src                           8.5G     88K    8.5G     0%    /usr/src
zroot/var/Named                         8.5G     88K    8.5G     0%    /var/Named
zroot/var/account                       8.5G     88K    8.5G     0%    /var/account
zroot/var/audit                         8.5G     88K    8.5G     0%    /var/audit
zroot/var/backups                       8.5G     88K    8.5G     0%    /var/backups
zroot/var/cache/ccache                  8.5G     88K    8.5G     0%    /var/cache/ccache
zroot/var/cache/pkg                     8.5G     88K    8.5G     0%    /var/cache/pkg
zroot/var/cache/synth                   8.5G     88K    8.5G     0%    /var/cache/synth
zroot/var/crash                         8.5G     92K    8.5G     0%    /var/crash
zroot/var/db/etcupdate                  8.5G    2.1M    8.5G     0%    /var/db/etcupdate
zroot/var/db/fontconfig                 8.5G     88K    8.5G     0%    /var/db/fontconfig
zroot/var/db/hyperv                     8.5G     88K    8.5G     0%    /var/db/hyperv
zroot/var/db/ntp                        8.5G     88K    8.5G     0%    /var/db/ntp
zroot/var/db/pkg                        8.5G     88K    8.5G     0%    /var/db/pkg
zroot/var/db/ports                      8.5G     88K    8.5G     0%    /var/db/ports
zroot/var/db/portsnap                   8.5G     88K    8.5G     0%    /var/db/portsnap
zroot/var/db/tlpkg                      8.5G     88K    8.5G     0%    /var/db/tlpkg
zroot/var/empty                         8.5G     88K    8.5G     0%    /var/empty
zroot/var/lib                           8.5G     88K    8.5G     0%    /var/lib
zroot/var/log                           8.5G    108K    8.5G     0%    /var/log
zroot/var/mail                          8.5G     88K    8.5G     0%    /var/mail
zroot/var/munin                         8.5G     88K    8.5G     0%    /var/munin
zroot/var/run                           8.5G    144K    8.5G     0%    /var/run
zroot/var/spool                         8.5G     96K    8.5G     0%    /var/spool
zroot/var/spool/clientmqueue            8.5G     88K    8.5G     0%    /var/spool/clientmqueue
zroot/var/spool/mqueue                  8.5G     88K    8.5G     0%    /var/spool/mqueue
zroot/var/synth                         8.5G     88K    8.5G     0%    /var/synth
zroot/var/synth/builders                8.5G     88K    8.5G     0%    /var/synth/builders
zroot/var/synth/live_packages           8.5G     88K    8.5G     0%    /var/synth/live_packages
zroot/var/tmp                           8.5G     88K    8.5G     0%    /var/tmp
zroot/var/unbound                       8.5G     88K    8.5G     0%    /var/unbound
zdata/home                              2.6G     88K    2.6G     0%    /home
zdata/usr/local/moodledata              2.6G     88K    2.6G     0%    /usr/local/moodledata
zdata/usr/local/pgsql                   2.6G     88K    2.6G     0%    /usr/local/pgsql
zdata/usr/local/pgsql/data              2.6G     88K    2.6G     0%    /usr/local/pgsql/data
zdata/usr/local/pgsql/data/base         2.6G     88K    2.6G     0%    /usr/local/pgsql/data/base
zdata/usr/local/pgsql/data/pg_xlog      2.6G     88K    2.6G     0%    /usr/local/pgsql/data/pg_xlog
zdata/usr/local/www                     2.6G     88K    2.6G     0%    /usr/local/www
zdata/var/db/bacula                     2.6G     88K    2.6G     0%    /var/db/bacula
zdata/var/db/bareos                     2.6G     88K    2.6G     0%    /var/db/bareos
zdata/var/db/boinc                      2.6G     88K    2.6G     0%    /var/db/boinc
zdata/var/db/darkstat                   2.6G     88K    2.6G     0%    /var/db/darkstat
zdata/var/db/mysql                      2.6G     88K    2.6G     0%    /var/db/mysql
zdata/var/db/mysql_secure               2.6G     88K    2.6G     0%    /var/db/mysql_secure
zdata/var/db/mysql_tmpdir               2.6G     88K    2.6G     0%    /var/db/mysql_tmpdir
zdata/var/db/postgres                   2.6G     88K    2.6G     0%    /var/db/postgres
zdata/var/db/postgres/data10            2.6G     88K    2.6G     0%    /var/db/postgres/data10
zdata/var/db/postgres/data10/base       2.6G     88K    2.6G     0%    /var/db/postgres/data10/base
zdata/var/db/postgres/data10/pg_wal     2.6G     88K    2.6G     0%    /var/db/postgres/data10/pg_wal
zdata/var/db/postgres/data11            2.6G     88K    2.6G     0%    /var/db/postgres/data11
zdata/var/db/postgres/data11/base       2.6G     88K    2.6G     0%    /var/db/postgres/data11/base
zdata/var/db/postgres/data11/pg_wal     2.6G     88K    2.6G     0%    /var/db/postgres/data11/pg_wal
zdata/var/db/postgres/data96            2.6G     88K    2.6G     0%    /var/db/postgres/data96
zdata/var/db/postgres/data96/base       2.6G     88K    2.6G     0%    /var/db/postgres/data96/base
zdata/var/db/postgres/data96/pg_xlog    2.6G     88K    2.6G     0%    /var/db/postgres/data96/pg_xlog
zdata/var/db/prometheus                 2.6G     88K    2.6G     0%    /var/db/prometheus
zdata/var/db/prometheus/data            2.6G     88K    2.6G     0%    /var/db/prometheus/data
zdata/var/db/prometheus/data/wal        2.6G     88K    2.6G     0%    /var/db/prometheus/data/wal
zdata/var/spool/bareos                  2.6G     88K    2.6G     0%    /var/spool/bareos
zdata/var/spool/ftp                     2.6G     88K    2.6G     0%    /var/spool/ftp
Comment 2 Trond.Endrestol 2019-05-20 11:25:46 UTC
Created attachment 204479 [details]
Patch for libexec/rc/rc.d/zfs making parallel mount optional

Here's a half-baked solution to my dilemma.
Comment 3 Trond.Endrestol 2019-05-20 15:11:34 UTC
Created attachment 204484 [details]
Corrected patch for libexec/rc/rc.d/zfs making parallel mount optional

Added the missing ].
Comment 4 Trond.Endrestol 2019-05-21 09:33:09 UTC
Created attachment 204504 [details]
Streamlined the patch for libexec/rc/rc.d/zfs making parallel mount optional

Filtering in the right order is key. First eliminate filesystems with no valid mountpoints, next eliminate filesystems having canmount=off. We're left with a simple on-off switch, defaulting to off.
Comment 5 Peter Eriksson 2019-05-21 22:17:17 UTC
We do something similar on our servers - however the main reason we do it is to speed things up. Normally a standard "zfs mount -a" would take slightly less than 1 hour to execute (a lot of ZFS filesystems) which was a bit of a pain when we needed to reboot the servers (or they crashed during office hours).

So we've been investigating various ways of speeding things up. A hack was to run a large number of individual "zfs mount $FS" commands in parallell, but then you need to sort the filesystems so they get mounted in the right order... We also do the bulk of the filesystem mounting in the background so we atleast get a login prompt and can login to the machine and see what is happening... The hack was surprisingly effective - 5-10 minutes instead of 1 hour. :-)

The scripts can very probably be improved a lot but they do the job for us. Would have been nice to have a "zfs mount --recursive POOL" command. Or a "zfs mount --parallell=20 -a" command. :-)


From our /etc/rc.d/zfs:

zfs_start_main()
{
        /sbin/zfs-speedmount > /var/log/zfs/zfs-speedmount.log 2>&1
        zfs share -a
        touch /etc/zfs/exports
}

...

zfs_start()
{
        if [ `$SYSCTL_N security.jail.jailed` -eq 1 ]; then
                zfs_start_jail
        else
            # Empty exports-file, to prevent annoying error messages while mounting
            echo -n "" > /etc/zfs/exports

            # Mount all other root pool filesystems first
            /sbin/zfs-mountrpool -v

            # And then we do the rest in the background
            zfs_start_main &
        fi
}

(I'll add our scripts as attachments in case they might be useful to someone).
Comment 6 Peter Eriksson 2019-05-21 22:18:24 UTC
Created attachment 204524 [details]
Shell script to mount all filesystems in a zpool (the pool for / by default)
Comment 7 Peter Eriksson 2019-05-21 22:19:26 UTC
Created attachment 204525 [details]
Script to parallel-mount zfs filesystems (uses "parallel")
Comment 8 fullermd 2019-06-02 21:21:30 UTC
This bit me on a server recently, which upgraded across a few months of stable/12, then came up with little things like /usr/local masked away.  Adjusting libzfs to effectively serialize the mounting turned out to be my best solution.
Comment 9 fullermd 2019-06-02 21:22:43 UTC
Created attachment 204788 [details]
Hack to serialize libzfs's mounting

Apply and rebuild/install src/cddl/libzfs/
Comment 10 Ben Morrow 2019-07-01 14:01:44 UTC
You can turn the parallel mounting off by setting ZFS_MOUNT_SERIAL=1 in the environment when you run either `zfs mount -a` or `zfs import` (without -N). IMHO rc.d/zfs should do this until the code is fixed properly.

I had a brief look at the code and it's trying to do the right thing; I can't yet see why it's not working as it should. Does anyone know if the same problem exists on other OSes?
Comment 11 fullermd 2019-07-17 03:25:52 UTC
(In reply to Ben Morrow from comment #10)

For future readers: that's ZFS_SERIAL_MOUNT  (not MOUNT_SERIAL, which I spent a while fruitlessly trying to find ;).  It's not a 1/0, it's just a present/not.  It _does_ seem to work around things, but apparently you can't set it in $zfs_env in rc.conf; have to hand-edit it into the rc.d/zfs script.  But doing so does work around it.
Comment 12 fullermd 2019-07-17 03:37:42 UTC
Now, bad news: I upgraded another system from a late-Dec stable/12 to last night's stable/12, and it also came up broken.  Whee.

But, good news: I've managed to track down how to reproduce it!  It has to do with having multiple ZFS FS trees mounted on the same point.  This apparently happened on several of my systems because I have a lot of pool/usr and pool/usr/local (mounted in the places you'd expect), but also pool/bsd/src and pool/bsd/obj mounted on /usr/{src,obj}.

So I grabbed the FreeBSD-12.0-STABLE-amd64-20190711-r349903.raw.xz snapshot and booted it up in bhyve.  I created a separate ~100 meg scratch zvol and also attached it in, as the backing disk for a test zpool (two -d's to /usr/share/examples/vmrun.sh) creatively named 'ztst'.

And then worked up the repro script I'm attaching.  Something like half the time, it "succeeds" in failing.  It seems that the way in which the zfs create's write stuff out is what causes the variability.  When it shows up the failure in a run, repeated `service zfs restart`'s will (almost?) always continue to show the failure, but in runs where everything shows up fine the first time, they'll continue to show up fine.

With any luck, this should let somebody familiar with the code quickly pinpoint the cause of the issues.  I assume you can just run the script 'till it fails, then start instrumenting the crap out of zfs_foreach_mountpoint().


(obviously, you don't want to try running this on a ZFS root system, or Bad Things will probably happen the first time it runs "zfs unmount -a"!)


An example "broken" run:

# /try2_zfs.sh
Created:
/ztst/t2fs/afile
/ztst/t2fs/somedir/afile
/ztst/t2fs/somedir/anotherdir/afile
/ztst/t2fs/somedir/subdir/afile

Restarted:
/ztst/t2fs/afile
/ztst/t2fs/somedir/afile

Restarted serially:
/ztst/t2fs/afile
/ztst/t2fs/somedir/afile
/ztst/t2fs/somedir/anotherdir/afile
/ztst/t2fs/somedir/subdir/afile
Comment 13 fullermd 2019-07-17 03:38:44 UTC
Created attachment 205830 [details]
Repro script (run from /)

Be sure to run it on a scratch zpool, on a system not otherwise relying on ZFS!
Comment 14 fullermd 2019-07-18 02:35:41 UTC
Well, I spent some time looking at it.  The source of the issue winds up being how the FS's are sorted and fed into the parallel mounting.  Because the list winds up being "/x/y/z /x/y/z /x/y/z/a [...]", the first /x finds out that the second is not one of its descendents, so assumes it has nothing else to do in that branch.  Meanwhile, the higher level also skips it in weird ways.  I'm a little unsure why it's not deterministic, but...

Anyway, I worked up a patch that seems like it should fix it, and in testing I'm no longer able to reproduce it.  Then I made the mistake of going over to look at ZoL, to see if they'd seen anything similar and had near-enough code for it to apply there.  And it turns out they actually applied a slightly different (but I think semantically equivalent) fix, for what looks like slightly different symptoms, just last week.  With their patch, I'm also unable to reproduce it.

So, see https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce#diff-13df3dd8690ce1edf20baa129b4e9608
Comment 15 fullermd 2019-07-18 02:36:55 UTC
Created attachment 205857 [details]
ZoL fix

This is the functional bit of the ZoL fix, versus stable/12  (as of the last snapshot date, which I used to test).  From a quick glance, head doens't have any other nearby changes, so presumably applies clean to both branches.
Comment 16 Tomoaki AOKI 2019-07-21 23:51:58 UTC
(In reply to fullermd from comment #15)

Thanks. This seem to fix my problem.

My problem was that datasets manually imported (after boot) pools
usually wasn't mounted properly. Not accessible until manual unmount / mount
for it is done.
Note that it rarely succeeded on import, so there should be some race.
This is why I cannot clearly state FIXED.
I would need several months to be sure, as I had 1 or 2 succeeds after initial commit of parallel-mounting was done.

BTW, is this change possible to commit? I'm afraid there can be licensing problem (GPL pollution), as this doesn't seem to be in Illumos currently.
If this change can be (dual) licensed with CDDL (or BSD), it would be OK.
Comment 17 Bertrand Petit 2019-07-22 15:48:59 UTC
I confirm that attachment 205857 [details] fixes the problem I reported separately at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239243 for 11.3-STABLE.
Comment 18 commit-hook freebsd_committer 2019-07-26 13:12:40 UTC
A commit references this bug:

Author: bapt
Date: Fri Jul 26 13:12:33 UTC 2019
New revision: 350358
URL: https://svnweb.freebsd.org/changeset/base/350358

Log:
  Fix a bug introduced with parallel mounting of zfs

  Incorporate a fix from zol:
  https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce

  commit log from upstream:
   Fix race in parallel mount's thread dispatching algorithm

  Strategy of parallel mount is as follows.

  1) Initial thread dispatching is to select sets of mount points that
   don't have dependencies on other sets, hence threads can/should run
   lock-less and shouldn't race with other threads for other sets. Each
   thread dispatched corresponds to top level directory which may or may
   not have datasets to be mounted on sub directories.

  2) Subsequent recursive thread dispatching for each thread from 1)
   is to mount datasets for each set of mount points. The mount points
   within each set have dependencies (i.e. child directories), so child
   directories are processed only after parent directory completes.

  The problem is that the initial thread dispatching in
  zfs_foreach_mountpoint() can be multi-threaded when it needs to be
  single-threaded, and this puts threads under race condition. This race
  appeared as mount/unmount issues on ZoL for ZoL having different
  timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
  `zfs unmount -a` which expects proper mount order can't unmount if the
  mounts were reordered by the race condition.

  There are currently two known patterns of input list `handles` in
  `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

  1) #8833 case where input is `/a /a /a/b` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list with two same top level directories.
   There is a race between two POSIX threads A and B,
    * ThreadA for "/a" for test1 and "/a/b"
    * ThreadB for "/a" for test0/a
   and in case of #8833, ThreadA won the race. Two threads were created
   because "/a" wasn't considered as `"/a" contains "/a"`.

  2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list containing "/".
   There is a race between two POSIX threads A and B,
    * ThreadA for "/" and "/var/data/test"
    * ThreadB for "/var/data"
   and in case of #8450, ThreadA won the race. Two threads were created
   because "/var/data" wasn't considered as `"/" contains "/var/data"`.
   In other words, if there is (at least one) "/" in the input list,
   the initial thread dispatching must be single-threaded since every
   directory is a child of "/", meaning they all directly or indirectly
   depend on "/".

  In both cases, the first non_descendant_idx() call fails to correctly
  determine "path1-contains-path2", and as a result the initial thread
  dispatching creates another thread when it needs to be single-threaded.
  Fix a conditional in libzfs_path_contains() to consider above two.

  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
  Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>

  PR:		237517, 237397, 239243
  Submitted by:	Matthew D. Fuller <fullermd@over-yonder.net> (by email)
  MFC after:	3 days

Changes:
  head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
Comment 19 commit-hook freebsd_committer 2019-07-29 08:15:03 UTC
A commit references this bug:

Author: bapt
Date: Mon Jul 29 08:14:36 UTC 2019
New revision: 350401
URL: https://svnweb.freebsd.org/changeset/base/350401

Log:
  MFC r350358:

  Fix a bug introduced with parallel mounting of zfs

  Incorporate a fix from zol:
  https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce

  commit log from upstream:
   Fix race in parallel mount's thread dispatching algorithm

  Strategy of parallel mount is as follows.

  1) Initial thread dispatching is to select sets of mount points that
   don't have dependencies on other sets, hence threads can/should run
   lock-less and shouldn't race with other threads for other sets. Each
   thread dispatched corresponds to top level directory which may or may
   not have datasets to be mounted on sub directories.

  2) Subsequent recursive thread dispatching for each thread from 1)
   is to mount datasets for each set of mount points. The mount points
   within each set have dependencies (i.e. child directories), so child
   directories are processed only after parent directory completes.

  The problem is that the initial thread dispatching in
  zfs_foreach_mountpoint() can be multi-threaded when it needs to be
  single-threaded, and this puts threads under race condition. This race
  appeared as mount/unmount issues on ZoL for ZoL having different
  timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
  `zfs unmount -a` which expects proper mount order can't unmount if the
  mounts were reordered by the race condition.

  There are currently two known patterns of input list `handles` in
  `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

  1) #8833 case where input is `/a /a /a/b` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list with two same top level directories.
   There is a race between two POSIX threads A and B,
    * ThreadA for "/a" for test1 and "/a/b"
    * ThreadB for "/a" for test0/a
   and in case of #8833, ThreadA won the race. Two threads were created
   because "/a" wasn't considered as `"/a" contains "/a"`.

  2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list containing "/".
   There is a race between two POSIX threads A and B,
    * ThreadA for "/" and "/var/data/test"
    * ThreadB for "/var/data"
   and in case of #8450, ThreadA won the race. Two threads were created
   because "/var/data" wasn't considered as `"/" contains "/var/data"`.
   In other words, if there is (at least one) "/" in the input list,
   the initial thread dispatching must be single-threaded since every
   directory is a child of "/", meaning they all directly or indirectly
   depend on "/".

  In both cases, the first non_descendant_idx() call fails to correctly
  determine "path1-contains-path2", and as a result the initial thread
  dispatching creates another thread when it needs to be single-threaded.
  Fix a conditional in libzfs_path_contains() to consider above two.

  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
  Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>

  PR:  237517, 237397, 239243
  Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email)

Changes:
_U  stable/12/
  stable/12/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
Comment 20 commit-hook freebsd_committer 2019-07-29 08:24:10 UTC
A commit references this bug:

Author: bapt
Date: Mon Jul 29 08:23:15 UTC 2019
New revision: 350402
URL: https://svnweb.freebsd.org/changeset/base/350402

Log:
  MFC r350358:

  Fix a bug introduced with parallel mounting of zfs

  Incorporate a fix from zol:
  https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce

  commit log from upstream:
   Fix race in parallel mount's thread dispatching algorithm

  Strategy of parallel mount is as follows.

  1) Initial thread dispatching is to select sets of mount points that
   don't have dependencies on other sets, hence threads can/should run
   lock-less and shouldn't race with other threads for other sets. Each
   thread dispatched corresponds to top level directory which may or may
   not have datasets to be mounted on sub directories.

  2) Subsequent recursive thread dispatching for each thread from 1)
   is to mount datasets for each set of mount points. The mount points
   within each set have dependencies (i.e. child directories), so child
   directories are processed only after parent directory completes.

  The problem is that the initial thread dispatching in
  zfs_foreach_mountpoint() can be multi-threaded when it needs to be
  single-threaded, and this puts threads under race condition. This race
  appeared as mount/unmount issues on ZoL for ZoL having different
  timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
  `zfs unmount -a` which expects proper mount order can't unmount if the
  mounts were reordered by the race condition.

  There are currently two known patterns of input list `handles` in
  `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

  1) #8833 case where input is `/a /a /a/b` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list with two same top level directories.
   There is a race between two POSIX threads A and B,
    * ThreadA for "/a" for test1 and "/a/b"
    * ThreadB for "/a" for test0/a
   and in case of #8833, ThreadA won the race. Two threads were created
   because "/a" wasn't considered as `"/a" contains "/a"`.

  2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
   The problem is that libzfs_path_contains() can't correctly handle an
   input list containing "/".
   There is a race between two POSIX threads A and B,
    * ThreadA for "/" and "/var/data/test"
    * ThreadB for "/var/data"
   and in case of #8450, ThreadA won the race. Two threads were created
   because "/var/data" wasn't considered as `"/" contains "/var/data"`.
   In other words, if there is (at least one) "/" in the input list,
   the initial thread dispatching must be single-threaded since every
   directory is a child of "/", meaning they all directly or indirectly
   depend on "/".

  In both cases, the first non_descendant_idx() call fails to correctly
  determine "path1-contains-path2", and as a result the initial thread
  dispatching creates another thread when it needs to be single-threaded.
  Fix a conditional in libzfs_path_contains() to consider above two.

  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
  Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>

  PR:  237517, 237397, 239243
  Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email)

Changes:
_U  stable/11/
  stable/11/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c