Both Jim Long and myself experience ZFS parallel mount misses mounting intermediate filesystems. As far as I can tell, this behaviour is not deterministic, sometimes the parallel mounting succeeds and sometimes it doesn't. See these threads, https://lists.freebsd.org/pipermail/freebsd-stable/2019-April/090898.html and https://lists.freebsd.org/pipermail/freebsd-questions/2019-April/284974.html. Here are a couple of suggestions: Can we make parallel mounting optional for us that don't need it or want it? Or, can parallel mounting be changed so that filesystems off the root pool is completely mounted before all other pools? This still leaves the possibility of the kernel attempting to mount pool/a/b before pool/a, even if both filesystems have their canmount properties set to on, and one being a child of the other.
I created a new VM to play with. It's running the 12.0-STABLE amd64 snapshot of 2019-04-18. Here's the output of df -ah after a normal boot: Filesystem Size Used Avail Capacity Mounted on zroot/ROOT/20190418-r346338 10G 1.7G 8.5G 16% / devfs 1.0K 1.0K 0B 100% /dev zdata/usr/local/www 2.6G 88K 2.6G 0% /usr/local/www zdata/usr/local/pgsql 2.6G 88K 2.6G 0% /usr/local/pgsql zdata/home 2.6G 88K 2.6G 0% /home zdata/usr/local/moodledata 2.6G 88K 2.6G 0% /usr/local/moodledata zdata/var/db/bacula 2.6G 88K 2.6G 0% /var/db/bacula zdata/var/db/mysql 2.6G 88K 2.6G 0% /var/db/mysql zdata/var/db/darkstat 2.6G 88K 2.6G 0% /var/db/darkstat zdata/usr/local/pgsql/data 2.6G 88K 2.6G 0% /usr/local/pgsql/data zdata/var/db/boinc 2.6G 88K 2.6G 0% /var/db/boinc zdata/var/db/mysql_tmpdir 2.6G 88K 2.6G 0% /var/db/mysql_tmpdir zdata/var/db/postgres 2.6G 88K 2.6G 0% /var/db/postgres zdata/var/db/mysql_secure 2.6G 88K 2.6G 0% /var/db/mysql_secure zdata/usr/local/pgsql/data/base 2.6G 88K 2.6G 0% /usr/local/pgsql/data/base zdata/var/db/prometheus 2.6G 88K 2.6G 0% /var/db/prometheus zdata/var/db/bareos 2.6G 88K 2.6G 0% /var/db/bareos zdata/usr/local/pgsql/data/pg_xlog 2.6G 88K 2.6G 0% /usr/local/pgsql/data/pg_xlog zdata/var/db/postgres/data10 2.6G 88K 2.6G 0% /var/db/postgres/data10 zroot/usr/compat 8.5G 88K 8.5G 0% /usr/compat zdata/var/db/postgres/data11 2.6G 88K 2.6G 0% /var/db/postgres/data11 zroot/media 8.5G 96K 8.5G 0% /media zroot/var/empty 8.5G 88K 8.5G 0% /var/empty zroot/usr/obj 8.5G 88K 8.5G 0% /usr/obj zroot/var/audit 8.5G 88K 8.5G 0% /var/audit zroot/usr/src 8.5G 88K 8.5G 0% /usr/src zroot/usr/ports 8.5G 88K 8.5G 0% /usr/ports zroot/var/unbound 8.5G 88K 8.5G 0% /var/unbound zroot/var/crash 8.5G 92K 8.5G 0% /var/crash zroot/nfs 8.5G 88K 8.5G 0% /nfs zroot/var/db/ports 8.5G 88K 8.5G 0% /var/db/ports zroot/var/backups 8.5G 88K 8.5G 0% /var/backups zroot/var/db/fontconfig 8.5G 88K 8.5G 0% /var/db/fontconfig zdata/var/db/prometheus/data 2.6G 88K 2.6G 0% /var/db/prometheus/data zroot/var/db/ntp 8.5G 88K 8.5G 0% /var/db/ntp zroot/var/mail 8.5G 88K 8.5G 0% /var/mail zroot/usr/local 8.5G 96K 8.5G 0% /usr/local zroot/usr/docker 8.5G 88K 8.5G 0% /usr/docker zroot/var/Named 8.5G 88K 8.5G 0% /var/Named zroot/usr/local/var 8.5G 88K 8.5G 0% /usr/local/var zroot/var/db/etcupdate 8.5G 2.1M 8.5G 0% /var/db/etcupdate zroot/usr/local/etc 8.5G 88K 8.5G 0% /usr/local/etc zroot/var/cache/ccache 8.5G 88K 8.5G 0% /var/cache/ccache zroot/var/run 8.5G 144K 8.5G 0% /var/run zroot/var/db/portsnap 8.5G 88K 8.5G 0% /var/db/portsnap zroot/usr/local/certs 8.5G 88K 8.5G 0% /usr/local/certs zroot/var/db/tlpkg 8.5G 88K 8.5G 0% /var/db/tlpkg zroot/var/munin 8.5G 88K 8.5G 0% /var/munin zroot/var/db/hyperv 8.5G 88K 8.5G 0% /var/db/hyperv zroot/var/spool 8.5G 96K 8.5G 0% /var/spool zroot/var/tmp 8.5G 88K 8.5G 0% /var/tmp zroot/usr/local/tests 8.5G 88K 8.5G 0% /usr/local/tests zroot/tmp 8.5G 120K 8.5G 0% /tmp zroot/var/cache/synth 8.5G 88K 8.5G 0% /var/cache/synth zroot/var/lib 8.5G 88K 8.5G 0% /var/lib zroot/var/cache/pkg 8.5G 88K 8.5G 0% /var/cache/pkg zroot/usr/local/info 8.5G 88K 8.5G 0% /usr/local/info zroot/var/log 8.5G 108K 8.5G 0% /var/log zroot/var/account 8.5G 88K 8.5G 0% /var/account zroot/var/synth 8.5G 88K 8.5G 0% /var/synth zroot/var/db/pkg 8.5G 88K 8.5G 0% /var/db/pkg zdata/var/db/postgres/data10/base 2.6G 88K 2.6G 0% /var/db/postgres/data10/base zdata/var/db/postgres/data96 2.6G 88K 2.6G 0% /var/db/postgres/data96 zdata/var/spool/bareos 2.6G 88K 2.6G 0% /var/spool/bareos zdata/var/db/postgres/data96/pg_xlog 2.6G 88K 2.6G 0% /var/db/postgres/data96/pg_xlog zroot/var/spool/mqueue 8.5G 88K 8.5G 0% /var/spool/mqueue zroot/var/spool/clientmqueue 8.5G 88K 8.5G 0% /var/spool/clientmqueue zdata/var/db/postgres/data11/base 2.6G 88K 2.6G 0% /var/db/postgres/data11/base zdata/var/spool/ftp 2.6G 88K 2.6G 0% /var/spool/ftp zdata/var/db/postgres/data96/base 2.6G 88K 2.6G 0% /var/db/postgres/data96/base zroot/usr/ports/local 8.5G 88K 8.5G 0% /usr/ports/local zroot/usr/ports/workdirs 8.5G 88K 8.5G 0% /usr/ports/workdirs zdata/var/db/postgres/data11/pg_wal 2.6G 88K 2.6G 0% /var/db/postgres/data11/pg_wal zroot/var/synth/builders 8.5G 88K 8.5G 0% /var/synth/builders zdata/var/db/prometheus/data/wal 2.6G 88K 2.6G 0% /var/db/prometheus/data/wal zroot/var/synth/live_packages 8.5G 88K 8.5G 0% /var/synth/live_packages zroot/usr/ports/distfiles 8.5G 88K 8.5G 0% /usr/ports/distfiles zdata/var/db/postgres/data10/pg_wal 2.6G 88K 2.6G 0% /var/db/postgres/data10/pg_wal zroot/usr/ports/packages 8.5G 88K 8.5G 0% /usr/ports/packages zroot/usr/compat/linux 8.5G 88K 8.5G 0% /usr/compat/linux Some filesystem are mounted ahead of their intermediary filesystems, and the intermediary filesystems once mounted will effectively hide their children. This can not continue. This is the same output after I had remounted the filesystems in the correct order: Filesystem Size Used Avail Capacity Mounted on zroot/ROOT/20190418-r346338 10G 1.7G 8.5G 16% / devfs 1.0K 1.0K 0B 100% /dev zroot/media 8.5G 96K 8.5G 0% /media zroot/nfs 8.5G 88K 8.5G 0% /nfs zroot/tmp 8.5G 116K 8.5G 0% /tmp zroot/usr/compat 8.5G 88K 8.5G 0% /usr/compat zroot/usr/compat/linux 8.5G 88K 8.5G 0% /usr/compat/linux zroot/usr/docker 8.5G 88K 8.5G 0% /usr/docker zroot/usr/local 8.5G 96K 8.5G 0% /usr/local zroot/usr/local/certs 8.5G 88K 8.5G 0% /usr/local/certs zroot/usr/local/etc 8.5G 88K 8.5G 0% /usr/local/etc zroot/usr/local/info 8.5G 88K 8.5G 0% /usr/local/info zroot/usr/local/tests 8.5G 88K 8.5G 0% /usr/local/tests zroot/usr/local/var 8.5G 88K 8.5G 0% /usr/local/var zroot/usr/obj 8.5G 88K 8.5G 0% /usr/obj zroot/usr/ports 8.5G 88K 8.5G 0% /usr/ports zroot/usr/ports/distfiles 8.5G 88K 8.5G 0% /usr/ports/distfiles zroot/usr/ports/local 8.5G 88K 8.5G 0% /usr/ports/local zroot/usr/ports/packages 8.5G 88K 8.5G 0% /usr/ports/packages zroot/usr/ports/workdirs 8.5G 88K 8.5G 0% /usr/ports/workdirs zroot/usr/src 8.5G 88K 8.5G 0% /usr/src zroot/var/Named 8.5G 88K 8.5G 0% /var/Named zroot/var/account 8.5G 88K 8.5G 0% /var/account zroot/var/audit 8.5G 88K 8.5G 0% /var/audit zroot/var/backups 8.5G 88K 8.5G 0% /var/backups zroot/var/cache/ccache 8.5G 88K 8.5G 0% /var/cache/ccache zroot/var/cache/pkg 8.5G 88K 8.5G 0% /var/cache/pkg zroot/var/cache/synth 8.5G 88K 8.5G 0% /var/cache/synth zroot/var/crash 8.5G 92K 8.5G 0% /var/crash zroot/var/db/etcupdate 8.5G 2.1M 8.5G 0% /var/db/etcupdate zroot/var/db/fontconfig 8.5G 88K 8.5G 0% /var/db/fontconfig zroot/var/db/hyperv 8.5G 88K 8.5G 0% /var/db/hyperv zroot/var/db/ntp 8.5G 88K 8.5G 0% /var/db/ntp zroot/var/db/pkg 8.5G 88K 8.5G 0% /var/db/pkg zroot/var/db/ports 8.5G 88K 8.5G 0% /var/db/ports zroot/var/db/portsnap 8.5G 88K 8.5G 0% /var/db/portsnap zroot/var/db/tlpkg 8.5G 88K 8.5G 0% /var/db/tlpkg zroot/var/empty 8.5G 88K 8.5G 0% /var/empty zroot/var/lib 8.5G 88K 8.5G 0% /var/lib zroot/var/log 8.5G 108K 8.5G 0% /var/log zroot/var/mail 8.5G 88K 8.5G 0% /var/mail zroot/var/munin 8.5G 88K 8.5G 0% /var/munin zroot/var/run 8.5G 144K 8.5G 0% /var/run zroot/var/spool 8.5G 96K 8.5G 0% /var/spool zroot/var/spool/clientmqueue 8.5G 88K 8.5G 0% /var/spool/clientmqueue zroot/var/spool/mqueue 8.5G 88K 8.5G 0% /var/spool/mqueue zroot/var/synth 8.5G 88K 8.5G 0% /var/synth zroot/var/synth/builders 8.5G 88K 8.5G 0% /var/synth/builders zroot/var/synth/live_packages 8.5G 88K 8.5G 0% /var/synth/live_packages zroot/var/tmp 8.5G 88K 8.5G 0% /var/tmp zroot/var/unbound 8.5G 88K 8.5G 0% /var/unbound zdata/home 2.6G 88K 2.6G 0% /home zdata/usr/local/moodledata 2.6G 88K 2.6G 0% /usr/local/moodledata zdata/usr/local/pgsql 2.6G 88K 2.6G 0% /usr/local/pgsql zdata/usr/local/pgsql/data 2.6G 88K 2.6G 0% /usr/local/pgsql/data zdata/usr/local/pgsql/data/base 2.6G 88K 2.6G 0% /usr/local/pgsql/data/base zdata/usr/local/pgsql/data/pg_xlog 2.6G 88K 2.6G 0% /usr/local/pgsql/data/pg_xlog zdata/usr/local/www 2.6G 88K 2.6G 0% /usr/local/www zdata/var/db/bacula 2.6G 88K 2.6G 0% /var/db/bacula zdata/var/db/bareos 2.6G 88K 2.6G 0% /var/db/bareos zdata/var/db/boinc 2.6G 88K 2.6G 0% /var/db/boinc zdata/var/db/darkstat 2.6G 88K 2.6G 0% /var/db/darkstat zdata/var/db/mysql 2.6G 88K 2.6G 0% /var/db/mysql zdata/var/db/mysql_secure 2.6G 88K 2.6G 0% /var/db/mysql_secure zdata/var/db/mysql_tmpdir 2.6G 88K 2.6G 0% /var/db/mysql_tmpdir zdata/var/db/postgres 2.6G 88K 2.6G 0% /var/db/postgres zdata/var/db/postgres/data10 2.6G 88K 2.6G 0% /var/db/postgres/data10 zdata/var/db/postgres/data10/base 2.6G 88K 2.6G 0% /var/db/postgres/data10/base zdata/var/db/postgres/data10/pg_wal 2.6G 88K 2.6G 0% /var/db/postgres/data10/pg_wal zdata/var/db/postgres/data11 2.6G 88K 2.6G 0% /var/db/postgres/data11 zdata/var/db/postgres/data11/base 2.6G 88K 2.6G 0% /var/db/postgres/data11/base zdata/var/db/postgres/data11/pg_wal 2.6G 88K 2.6G 0% /var/db/postgres/data11/pg_wal zdata/var/db/postgres/data96 2.6G 88K 2.6G 0% /var/db/postgres/data96 zdata/var/db/postgres/data96/base 2.6G 88K 2.6G 0% /var/db/postgres/data96/base zdata/var/db/postgres/data96/pg_xlog 2.6G 88K 2.6G 0% /var/db/postgres/data96/pg_xlog zdata/var/db/prometheus 2.6G 88K 2.6G 0% /var/db/prometheus zdata/var/db/prometheus/data 2.6G 88K 2.6G 0% /var/db/prometheus/data zdata/var/db/prometheus/data/wal 2.6G 88K 2.6G 0% /var/db/prometheus/data/wal zdata/var/spool/bareos 2.6G 88K 2.6G 0% /var/spool/bareos zdata/var/spool/ftp 2.6G 88K 2.6G 0% /var/spool/ftp
Created attachment 204479 [details] Patch for libexec/rc/rc.d/zfs making parallel mount optional Here's a half-baked solution to my dilemma.
Created attachment 204484 [details] Corrected patch for libexec/rc/rc.d/zfs making parallel mount optional Added the missing ].
Created attachment 204504 [details] Streamlined the patch for libexec/rc/rc.d/zfs making parallel mount optional Filtering in the right order is key. First eliminate filesystems with no valid mountpoints, next eliminate filesystems having canmount=off. We're left with a simple on-off switch, defaulting to off.
We do something similar on our servers - however the main reason we do it is to speed things up. Normally a standard "zfs mount -a" would take slightly less than 1 hour to execute (a lot of ZFS filesystems) which was a bit of a pain when we needed to reboot the servers (or they crashed during office hours). So we've been investigating various ways of speeding things up. A hack was to run a large number of individual "zfs mount $FS" commands in parallell, but then you need to sort the filesystems so they get mounted in the right order... We also do the bulk of the filesystem mounting in the background so we atleast get a login prompt and can login to the machine and see what is happening... The hack was surprisingly effective - 5-10 minutes instead of 1 hour. :-) The scripts can very probably be improved a lot but they do the job for us. Would have been nice to have a "zfs mount --recursive POOL" command. Or a "zfs mount --parallell=20 -a" command. :-) From our /etc/rc.d/zfs: zfs_start_main() { /sbin/zfs-speedmount > /var/log/zfs/zfs-speedmount.log 2>&1 zfs share -a touch /etc/zfs/exports } ... zfs_start() { if [ `$SYSCTL_N security.jail.jailed` -eq 1 ]; then zfs_start_jail else # Empty exports-file, to prevent annoying error messages while mounting echo -n "" > /etc/zfs/exports # Mount all other root pool filesystems first /sbin/zfs-mountrpool -v # And then we do the rest in the background zfs_start_main & fi } (I'll add our scripts as attachments in case they might be useful to someone).
Created attachment 204524 [details] Shell script to mount all filesystems in a zpool (the pool for / by default)
Created attachment 204525 [details] Script to parallel-mount zfs filesystems (uses "parallel")
This bit me on a server recently, which upgraded across a few months of stable/12, then came up with little things like /usr/local masked away. Adjusting libzfs to effectively serialize the mounting turned out to be my best solution.
Created attachment 204788 [details] Hack to serialize libzfs's mounting Apply and rebuild/install src/cddl/libzfs/
You can turn the parallel mounting off by setting ZFS_MOUNT_SERIAL=1 in the environment when you run either `zfs mount -a` or `zfs import` (without -N). IMHO rc.d/zfs should do this until the code is fixed properly. I had a brief look at the code and it's trying to do the right thing; I can't yet see why it's not working as it should. Does anyone know if the same problem exists on other OSes?
(In reply to Ben Morrow from comment #10) For future readers: that's ZFS_SERIAL_MOUNT (not MOUNT_SERIAL, which I spent a while fruitlessly trying to find ;). It's not a 1/0, it's just a present/not. It _does_ seem to work around things, but apparently you can't set it in $zfs_env in rc.conf; have to hand-edit it into the rc.d/zfs script. But doing so does work around it.
Now, bad news: I upgraded another system from a late-Dec stable/12 to last night's stable/12, and it also came up broken. Whee. But, good news: I've managed to track down how to reproduce it! It has to do with having multiple ZFS FS trees mounted on the same point. This apparently happened on several of my systems because I have a lot of pool/usr and pool/usr/local (mounted in the places you'd expect), but also pool/bsd/src and pool/bsd/obj mounted on /usr/{src,obj}. So I grabbed the FreeBSD-12.0-STABLE-amd64-20190711-r349903.raw.xz snapshot and booted it up in bhyve. I created a separate ~100 meg scratch zvol and also attached it in, as the backing disk for a test zpool (two -d's to /usr/share/examples/vmrun.sh) creatively named 'ztst'. And then worked up the repro script I'm attaching. Something like half the time, it "succeeds" in failing. It seems that the way in which the zfs create's write stuff out is what causes the variability. When it shows up the failure in a run, repeated `service zfs restart`'s will (almost?) always continue to show the failure, but in runs where everything shows up fine the first time, they'll continue to show up fine. With any luck, this should let somebody familiar with the code quickly pinpoint the cause of the issues. I assume you can just run the script 'till it fails, then start instrumenting the crap out of zfs_foreach_mountpoint(). (obviously, you don't want to try running this on a ZFS root system, or Bad Things will probably happen the first time it runs "zfs unmount -a"!) An example "broken" run: # /try2_zfs.sh Created: /ztst/t2fs/afile /ztst/t2fs/somedir/afile /ztst/t2fs/somedir/anotherdir/afile /ztst/t2fs/somedir/subdir/afile Restarted: /ztst/t2fs/afile /ztst/t2fs/somedir/afile Restarted serially: /ztst/t2fs/afile /ztst/t2fs/somedir/afile /ztst/t2fs/somedir/anotherdir/afile /ztst/t2fs/somedir/subdir/afile
Created attachment 205830 [details] Repro script (run from /) Be sure to run it on a scratch zpool, on a system not otherwise relying on ZFS!
Well, I spent some time looking at it. The source of the issue winds up being how the FS's are sorted and fed into the parallel mounting. Because the list winds up being "/x/y/z /x/y/z /x/y/z/a [...]", the first /x finds out that the second is not one of its descendents, so assumes it has nothing else to do in that branch. Meanwhile, the higher level also skips it in weird ways. I'm a little unsure why it's not deterministic, but... Anyway, I worked up a patch that seems like it should fix it, and in testing I'm no longer able to reproduce it. Then I made the mistake of going over to look at ZoL, to see if they'd seen anything similar and had near-enough code for it to apply there. And it turns out they actually applied a slightly different (but I think semantically equivalent) fix, for what looks like slightly different symptoms, just last week. With their patch, I'm also unable to reproduce it. So, see https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce#diff-13df3dd8690ce1edf20baa129b4e9608
Created attachment 205857 [details] ZoL fix This is the functional bit of the ZoL fix, versus stable/12 (as of the last snapshot date, which I used to test). From a quick glance, head doens't have any other nearby changes, so presumably applies clean to both branches.
(In reply to fullermd from comment #15) Thanks. This seem to fix my problem. My problem was that datasets manually imported (after boot) pools usually wasn't mounted properly. Not accessible until manual unmount / mount for it is done. Note that it rarely succeeded on import, so there should be some race. This is why I cannot clearly state FIXED. I would need several months to be sure, as I had 1 or 2 succeeds after initial commit of parallel-mounting was done. BTW, is this change possible to commit? I'm afraid there can be licensing problem (GPL pollution), as this doesn't seem to be in Illumos currently. If this change can be (dual) licensed with CDDL (or BSD), it would be OK.
I confirm that attachment 205857 [details] fixes the problem I reported separately at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239243 for 11.3-STABLE.
A commit references this bug: Author: bapt Date: Fri Jul 26 13:12:33 UTC 2019 New revision: 350358 URL: https://svnweb.freebsd.org/changeset/base/350358 Log: Fix a bug introduced with parallel mounting of zfs Incorporate a fix from zol: https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce commit log from upstream: Fix race in parallel mount's thread dispatching algorithm Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> PR: 237517, 237397, 239243 Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email) MFC after: 3 days Changes: head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
A commit references this bug: Author: bapt Date: Mon Jul 29 08:14:36 UTC 2019 New revision: 350401 URL: https://svnweb.freebsd.org/changeset/base/350401 Log: MFC r350358: Fix a bug introduced with parallel mounting of zfs Incorporate a fix from zol: https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce commit log from upstream: Fix race in parallel mount's thread dispatching algorithm Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> PR: 237517, 237397, 239243 Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email) Changes: _U stable/12/ stable/12/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
A commit references this bug: Author: bapt Date: Mon Jul 29 08:23:15 UTC 2019 New revision: 350402 URL: https://svnweb.freebsd.org/changeset/base/350402 Log: MFC r350358: Fix a bug introduced with parallel mounting of zfs Incorporate a fix from zol: https://github.com/zfsonlinux/zfs/commit/ab5036df1ccbe1b18c1ce6160b5829e8039d94ce commit log from upstream: Fix race in parallel mount's thread dispatching algorithm Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) #8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of #8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) #8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of #8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> PR: 237517, 237397, 239243 Submitted by: Matthew D. Fuller <fullermd@over-yonder.net> (by email) Changes: _U stable/11/ stable/11/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
I vote for closing this PR as there was a genuine bug in the ZFS code now fixed, and you can export the env. var. ZFS_SERIAL_MOUNT if you want the old, slow behaviour.
There was a genuine bug in OpenZFS, fixed a long time ago as of today. I mark this PR as closed & fixed.