Bug 242189

Summary: ZFS filesystems are not mounted from ATA on startup due to rc <-> kernel race if root is on NVMe
Product: Base System Reporter: Oleg Sidorkin <osidorkin>
Component: miscAssignee: Chuck Tuffli <chuck>
Status: Closed FIXED    
Severity: Affects Only Me CC: chuck
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Add root_hold_wait to zpool, zfs, zfsbe, dumpon
none
mount output
none
zfs list output
none
Add root_hold_wait to zpool and dumpon none

Description Oleg Sidorkin 2019-11-24 13:41:04 UTC
Looks like root_hold_wait should be added to /etc/rc.d/zfs script after r355010.

If root is on NVMe and e.g. /var/db is on ZFS on top of ATA then root gets mounted and /etc/rc.d/zfs get started before ata bus scan completed.  System boots with ZFS filesystems unmounted. There is root_hold_wait calls in /etc/rc.d/mountcritlocal, but zfs filesystems are not mentioned in fstab.
Comment 1 Oleg Sidorkin 2020-12-01 20:39:35 UTC
Created attachment 220145 [details]
Add root_hold_wait to zpool, zfs, zfsbe, dumpon

Attached patch solves the problem for me
Comment 2 Chuck Tuffli freebsd_committer freebsd_triage 2021-01-16 23:34:47 UTC
Would it be possible to add the output from the commands 'mount' and 'zfs list' to ensure I'm reproducing this correctly?
Comment 3 Oleg Sidorkin 2021-01-17 14:48:06 UTC
Created attachment 221677 [details]
mount output
Comment 4 Oleg Sidorkin 2021-01-17 14:48:25 UTC
Created attachment 221678 [details]
zfs list output
Comment 5 Oleg Sidorkin 2021-01-17 14:51:37 UTC
(In reply to Chuck Tuffli from comment #2)
Here they are. Without patch all zfs systems are not mounted
Comment 6 Chuck Tuffli freebsd_committer freebsd_triage 2021-03-03 23:06:46 UTC
I was able to reproduce this using bhyve by delaying the response to identify for 10 seconds. When I discussed this with allanjude@, he believed adding root_hold_wait to /etc/rc.d/zpool should be the only change necessary.

This setup without root_hold_wait results in ZFS not mounting the file system on the ATA drive. Adding root_hold_wait to /etc/rc.d/zpool (i.e. a subset of the OP's patch) allows ZFS to find its file system on the slower ATA drive:

nvd0: 20480MB (41943040 512 byte sectors)
WARNING: WITNESS option enabled, expect reduced performance.
Trying to mount root from ufs:/dev/gpt/rootfs [rw]...
Setting hostuuid: ebf68f0a-7861-11eb-9340-589cfc0c0ccc.
Setting hostid: 0xea7b6da7.
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Waiting 30s for the root mount holders: CAM.........ada0 at ata2 bus 0 scbus0 target 0 lun 0
ada0: <BHYVE ATA IDE DISK 1.0> ATA-6 device
ada0: Serial Number 123456
ada0: 16.700MB/s transfers (WDMA2, PIO 65536bytes)
ada0: 51200MB (104857600 512 byte sectors)
Starting file system checks:
Comment 7 Oleg Sidorkin 2021-03-04 09:05:37 UTC
(In reply to Chuck Tuffli from comment #6)

Thanks for looking at this.

I've dropped changes for everything but zpool and dumpon scripts and zfs filesystems got mounted on startup. So zfs and zfsbe changes are not really needed.

Without root_hold_wait in dumpon script core dumps were lost because swap device (on the HDD also) was not ready early enough.
Comment 8 Chuck Tuffli freebsd_committer freebsd_triage 2021-03-04 16:48:03 UTC
OK, I'll include the change to dumpon and post a review later tonight. Thank you for tracking this down and helping make FreeBSD better!
Comment 9 Oleg Sidorkin 2021-03-04 19:25:39 UTC
Created attachment 222983 [details]
Add root_hold_wait to zpool and dumpon

(In reply to Chuck Tuffli from comment #8)

My current version of patch (waiting was moved just before importing pools not to slow startup down if there is nothing to import)
Comment 10 Chuck Tuffli freebsd_committer freebsd_triage 2021-03-08 22:06:54 UTC
Review https://reviews.freebsd.org/D29101
Comment 11 commit-hook freebsd_committer freebsd_triage 2021-04-05 16:31:10 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f30f11f878fe2aa535cd286810d31c92793a3d95

commit f30f11f878fe2aa535cd286810d31c92793a3d95
Author:     Chuck Tuffli <chuck@FreeBSD.org>
AuthorDate: 2021-03-05 16:13:23 +0000
Commit:     Chuck Tuffli <chuck@FreeBSD.org>
CommitDate: 2021-04-05 16:25:04 +0000

    wait for device mounts in zpool and dumpon

    If the root file system is composed from multiple devices, wait for
    devices to be ready before running zpool and dumpon rc scripts.

    An example of this is if the bulk of the root file system exists on a
    fast device (e.g. NVMe) but the /var directory comes from a ZFS dataset
    on a slower device (e.g. SATA). In this case, it is possible that the
    zpool import may run before the slower device has finished being probed,
    leaving the system in an intermediate state.

    Fix is to add root_hold_wait to the zpool and dumpon (which has a
    similar issue) rc scripts.

    PR:             242189
    Reported by:    osidorkin@gmail.com
    Reviewed by:    allanjude
    MFC after:      1 month
    Differential Revision:  https://reviews.freebsd.org/D29101

 libexec/rc/rc.d/dumpon | 2 ++
 libexec/rc/rc.d/zpool  | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)
Comment 12 Oleg Sidorkin 2021-05-07 19:04:27 UTC
Should I close this pr?
Comment 13 Chuck Tuffli freebsd_committer freebsd_triage 2021-05-07 19:35:24 UTC
If you are happy with how this is working, feel free to close. I still have an outstanding "to do" to MFC this to stable/13
Comment 14 Oleg Sidorkin 2021-05-08 22:31:23 UTC
Thanks
Comment 15 commit-hook freebsd_committer freebsd_triage 2022-11-19 17:42:12 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d878a66a9a77fc67801000bfab8fa77b2b3faecc

commit d878a66a9a77fc67801000bfab8fa77b2b3faecc
Author:     Chuck Tuffli <chuck@FreeBSD.org>
AuthorDate: 2021-03-05 16:13:23 +0000
Commit:     Chuck Tuffli <chuck@FreeBSD.org>
CommitDate: 2022-11-19 17:40:48 +0000

    wait for device mounts in zpool and dumpon

    If the root file system is composed from multiple devices, wait for
    devices to be ready before running zpool and dumpon rc scripts.

    An example of this is if the bulk of the root file system exists on a
    fast device (e.g. NVMe) but the /var directory comes from a ZFS dataset
    on a slower device (e.g. SATA). In this case, it is possible that the
    zpool import may run before the slower device has finished being probed,
    leaving the system in an intermediate state.

    Fix is to add root_hold_wait to the zpool and dumpon (which has a
    similar issue) rc scripts.

    PR:             242189

    (cherry picked from commit f30f11f878fe2aa535cd286810d31c92793a3d95)

 libexec/rc/rc.d/dumpon | 2 ++
 libexec/rc/rc.d/zpool  | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)