Bug 251610 - rc.d/zpool runs before ada(4) attaches
Summary: rc.d/zpool runs before ada(4) attaches
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 13.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-rc (Nobody)
URL: https://github.com/freebsd/freebsd-sr...
Keywords: needs-qa
Depends on:
Blocks:
 
Reported: 2020-12-05 17:41 UTC by Harald Schmalzbauer
Modified: 2024-09-17 07:29 UTC (History)
3 users (show)

See Also:


Attachments
extend rc.d/dumpon to check and delay for cam(4) probing (1.63 KB, patch)
2020-12-05 17:41 UTC, Harald Schmalzbauer
no flags Details | Diff
extend rc.d/dumpon to check and delay for cam(4) probing (1.67 KB, patch)
2021-02-15 21:21 UTC, Harald Schmalzbauer
no flags Details | Diff
Boot log excerpt (2.15 KB, text/plain)
2021-12-22 10:03 UTC, Hauke Fath
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Harald Schmalzbauer 2020-12-05 17:41:18 UTC
Created attachment 220284 [details]
extend rc.d/dumpon to check and delay for cam(4) probing

On systems with root from memory filesystem, and probably also true for nvd(4) backed root systems, rc.d/zpool doesn't work, because at the time it runs, cam(4) is still in state (aprobe) for attached SSDs (850pro in my test case).
I can imagine much slower disks attaching, leading to similar problems for the not so rare cases where root mounts not from md(4).

This affects dumpon and swapon too, which might have even higher impact than zpool!

The attached diff checks the output of camcontrol(8), which is in /sbin on rootfs, without utilizing anything from /usr filesystem.

The overhead should be neglectable on systems not affected, which unconditionally run the check, but it's not expensive considering it's only run once at startup.
Comment 1 Harald Schmalzbauer 2021-02-15 21:21:22 UTC
Created attachment 222477 [details]
extend rc.d/dumpon to check and delay for cam(4) probing

Accidentally found this unresolved PR and noticed, that the patch is outdated.
Here's what enables me to use OpenZFS in production.
Comment 2 Hauke Fath 2021-12-22 10:03:17 UTC
Created attachment 230308 [details]
Boot log excerpt
Comment 3 Hauke Fath 2021-12-22 10:18:32 UTC
Came here to say this.

In the process of setting up a fileserver, I noticed that a 10 TB pool replicated from the old omniosce machine wouldn't mount during boot, while it had no problem when imported manually.

My observations and comments:

The (rather misleading) error message from /etc/rc.d/zpool is frequently interspersed with kernel autoconfiguration messages (see attachment). This made it harder than necessary to figure out what was going wrong, together with my disbelief that something so string-and-ducttape-y should be shipped in a release.

Why does the kernel boot multi-user before it's done with autoconfiguration? And if parallelizing operations is the idea, why isn't there a barrier in place for things as vital as disk operations?

Et c'est pas fini: Downstream, mountd is utterly confused about a list of mounts in /etc/zfs/exports that don't exist ("mountd[7977]: bad exports list line 'redacted': symbolic link in export path or statfs failed"). Why is this information persisted, instead of being created during the zpool import under /var/run - or not, if the pool isn't found?

In the end, I inserted a 20 sec sleep in /etc/rc.d/zpool, and moved on, rather unimpressed.
Comment 4 Hauke Fath 2021-12-22 15:21:04 UTC
In the same venue, there is this <https://www.reddit.com/r/freebsd/comments/n0rxud/zfs_loads_before_disks_are_ready_during_boot_then/>
Comment 5 Graham Perrin freebsd_committer freebsd_triage 2022-10-17 12:40:23 UTC
Keyword: 

    patch
or  patch-ready

– in lieu of summary line prefix: 

    [patch]

* bulk change for the keyword
* summary lines may be edited manually (not in bulk). 

Keyword descriptions and search interface: 

    <https://bugs.freebsd.org/bugzilla/describekeywords.cgi>
Comment 6 Graham Perrin 2023-10-01 16:47:29 UTC
(In reply to Harald Schmalzbauer from comment #0)

> … probably also true for nvd(4) backed root systems, …

Is the situation significantly different with 15.0-CURRENT? 

<https://cgit.freebsd.org/src/log/?qt=grep&q=nvd>, a few things catch my eye.


(In reply to Harald Schmalzbauer from comment #1)

% git -C /usr/src pull --ff-only freebsd main
From https://git.freebsd.org/src
 * branch                      main       -> FETCH_HEAD
Already up to date.
% git -C /usr/src apply --check --verbose /tmp/222477.patch
Checking patch libexec/rc/rc.d/dumpon...
Hunk #1 succeeded at 47 (offset 12 lines).
% 


(In reply to Hauke Fath from comment #4)

Re: <https://old.reddit.com/r/freebsd/comments/n0rxud/-/gwbx5up/> for reference only, 

% sysctl -d kern.cam.boot_delay kern.cam.scsi_delay
kern.cam.boot_delay: Bus registration wait time
kern.cam.scsi_delay: Delay to allow devices to settle after a SCSI bus reset (ms)
%
Comment 7 Oleg Sidorkin 2024-07-04 21:00:39 UTC
Probably, it's a duplicate of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242189
Comment 8 Harald Schmalzbauer 2024-09-17 07:29:21 UTC
This is PR obsolete, see d878a66a9a, PR: 242189
Closing