Bug 263407 - RC3 Guided ZFS on root with encryption unmountable
Summary: RC3 Guided ZFS on root with encryption unmountable
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 13.1-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Kyle Evans
URL: https://reviews.freebsd.org/D35019
Keywords:
: 263343 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-04-18 22:55 UTC by Daniel Briggs
Modified: 2022-04-28 21:45 UTC (History)
13 users (show)

See Also:
kevans: mfc-stable13+
kevans: mfc-stable12+


Attachments
photograph (208.27 KB, image/jpeg)
2022-04-19 15:00 UTC, Graham Perrin
no flags Details
frame from a recording (538.99 KB, image/png)
2022-04-19 15:18 UTC, Graham Perrin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Briggs 2022-04-18 22:55:15 UTC
I attempted a fresh install of 13.1-RC3 from the amd64 memstick image, using the "Auto (ZFS) Guided Root-on ZFS" with "Encrypt Disk" set to "YES". However, after the installation, the system fails to mount the root filesystem. The error reported is:

Solaris: NOTICE: Cannot find the pool label for 'zroot'
Mounting from zfs:zroot/ROOT/default failed with error 5.

This is followed by the mountroot prompt. I do not encounter this issue if the "Encrypt Disk" feature is turned off.

Someone else appears to have encountered this issue in March on BETA2 and mailed one of the lists, but I did not see either a follow up, or a bug reported here.
https://lists.freebsd.org/archives/freebsd-stable/2022-March/000623.html
Comment 1 Graham Perrin freebsd_committer freebsd_triage 2022-04-19 01:30:35 UTC
Maybe related: 

main-n254654-d4e8207317c results in "no pools available to import"
<https://lists.freebsd.org/archives/freebsd-current/2022-April/001754.html>

<https://markmail.org/message/xua4rc3qtprkqiwu> and (threading broken) 
<https://markmail.org/message/77js65vrngjgumco>
Comment 2 Graham Perrin freebsd_committer freebsd_triage 2022-04-19 15:00:21 UTC
Created attachment 233334 [details]
photograph

(In reply to Graham Perrin from comment #1)

… on second thoughts, maybe _not_ related, although note the comment from Tom (Thomas) Laus at <https://lists.freebsd.org/archives/freebsd-current/2022-April/001824.html>.

(In reply to Daniel Briggs from comment #0)

Reproducible with    FreeBSD-13.1-RC3-amd64-memstick.img 
on an HP ProBook 450 G8.

Reproducible with    FreeBSD-13.1-RC3-amd64-disc1.iso 
on the same computer, photograph attached.
Comment 3 Graham Perrin freebsd_committer freebsd_triage 2022-04-19 15:18:07 UTC
Created attachment 233335 [details]
frame from a recording

(In reply to Graham Perrin from comment #2)

Copy typed from a frame from a recording (apologies for any typo): 

----

Loading kernel...
…
Loading configured modules...
/boot/kernel/geom_eli.ko
elf64_obj_loadimage: read section headers failed

can't load file '/boot/kernel/geom_eli.ko': input/output error
/boot/kernel/cryptodev.ko size 0xa158 at 0x2131000
can't find '/etc/hostid'
/boot/kernel/zfs.ko size 0x5b93a0 at 0x213c000
/boot/entropy size=0x1000
staging 0x2b000000 (not copying) tramp 0x33b3d000 PT4 0x339ef000
Start @ 0xffffffff80385000 ...
EFI framebuffer information: 
…
Comment 4 freebsd 2022-04-20 07:38:03 UTC
I am not using any GELI partitions; got the issue on 14-CURRENT of 20220407 (latest commit of 0c348b97eb0) on Framework laptop (c 202202; Intel i5-1135G7; NVMe disk; UEFI only).

There was|is no issue with "gpart show"; no issue with import & export of root pool in "Live CD" mode; went back to earlier working boot environment of 20220402 (main-n254246-6833ac673d9).
Comment 5 Mason Loring Bliss freebsd_triage 2022-04-20 15:43:23 UTC
As a datapoint, I'd noticed this recently, in my case using my "ZFS and
GELI by hand" guide as the install method. On reading this bug (thanks
parv) I ran a test install on a VM (libvirt), letting the installer
partition for me, and the issue was there. I confirmed that it's also 
there in RC1, and then performed some more comprehensive testing.

I fished out all the ISOs I have locally as the older ones don't seem to 
be available from freebsd.org any more, at least where I looked. I ran
most tests on a VM (the legacy results) and a couple sorts of hardware
(both UEFI results).

Tests run so far:

STABLE-20211111 UEFI and legacy, boots with and without GELI
BETA1 UEFI and legacy, boots with and without GELI
BETA2 legacy, boots with and without GELI
BETA3 legacy, boots with and without GELI
RC1 legacy, GELI fails to boot
RC2 legacy, GELI fails to boot
RC3 legacy, GELI fails to boot

So, this issue seems to have emerged between BETA3 and RC1.
Comment 6 Glen Barber freebsd_committer freebsd_triage 2022-04-20 16:49:03 UTC
Martin,

Is it possible this was introduced with 99c3f15774595c1163ecec87aa8cb157ccdc7d96?

That is the only ZFS-related change between BETA3 and RC1.
Comment 7 Martin Matuska freebsd_committer freebsd_triage 2022-04-20 21:13:00 UTC
I have started debugging and here is my finding:

There are no errors with 13.1-BETA3. But this doesn't mean it is "fine".

When doing a 13.1-RC3 install with ZFS+GELI, the loader fails to load /boot/kernel/geom_eli.ko with the following error message:
elf64_obj_loadimage: read section headers failed

That way geom_eli.ko is not loaded and ZFS is of course unable to find the "zroot" pool when trying to mount root, because the geli device does not exist.

The file has the correct checksum. But the loader can't cope with it - unknown reason. You can load other modules.

I have found this workaround:
When you exit to shell after running the automated installer (before doing reboot) and do the following:

cp /boot/kernel/geom_eli.ko /tmp
rm /boot/kernel/geom_eli.ko
cp /tmp/geom_eli.ko /boot/kernel/geom_eli.ko

And then reboot, the system boots fine.
Is this a problem in the ZFS support of our loader (not being able to read a specific file out of ZFS) or elsewhere?

In BETA3, the geom_eli.ko might be just located on a different position on the root pool and that why it is being read correctly by the loader.
Comment 8 Mason Loring Bliss freebsd_triage 2022-04-20 21:58:24 UTC
Martin,

Just to confirm, kevans had me capture some output and what I see matches:

Loading configured modules...^M
|^H/^H-^H\^H|^H/boot/kernel/zfs.ko /^H-^Hsize 0x5b93a0 at 0x2131000^M
\^H|^H/boot/kernel/geom_eli.ko ^M
elf64_obj_loadimage: read section headers failed^M
^M
/^H-^H/boot/kernel/geom_eli.ko ^M
elf64_obj_loadimage: read section headers failed^M
^M
can't load file '/boot/kernel/geom_eli.ko': input/output error^M
\^H|^H/boot/entropy size=0x1000^M
/^H-^H\^H|^H/boot/kernel/cryptodev.ko size 0xa158 at 0x26ec000^M
/^H-^Hcan't find '/etc/hostid'^M
\^H---<<BOOT>>---^M
Comment 9 Mason Loring Bliss freebsd_triage 2022-04-20 22:08:24 UTC
And further confirmation, your workaround works and previously-unbootable
systems boot. Bizarre stuff.
Comment 10 Kyle Evans freebsd_committer freebsd_triage 2022-04-21 02:02:57 UTC
(In reply to Martin Matuska from comment #7)

To summarize some of what I've discovered:

loader fails to read it because we process the first datablk and then still have 8 bytes left to read.  Note the zdb output:

obj=14 dataset=zroot/ROOT/default path=/boot type=20 bonustype=44
obj=77 dataset=zroot/ROOT/default path=/boot/kernel type=20 bonustype=44
obj=16532 dataset=zroot/ROOT/default path=/boot/kernel/geom_eli.ko type=19 bonustype=44

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     16532    1   128K   128K    72K     512   128K  100.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               168   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 0
	uid     0
	gid     0
	atime	Wed Apr 20 16:36:09 2022
	mtime	Thu Apr 14 05:24:57 2022
	ctime	Wed Apr 20 16:36:13 2022
	crtime	Thu Apr 14 05:24:57 2022
	gen	105
	mode	100555
	size	131080
	parent	77
	links	1
	pflags	40800000104
Indirect blocks:
               0 L0 0:12eb72000:12000 20000L/12000P F=1 B=105/105 cksum=16fe63abceef:30eb7f3af17191f:2407cd8cfd04b694:d5743be33ba51582

		segment [0000000000000000, 0000000000020000) size  128K

There's one indirect block that consists of 128k data, but the file size is 128k + 8 bytes. zdb doesn't indicate any embedded data, so it's not clear where this remaining 8 bytes comes from but it seems likely that we're just mishandling this situation. When you copy the file away, you get this instead, which loader *can* grok:

obj=14 dataset=zroot/ROOT/default path=/boot type=20 bonustype=44
obj=115 dataset=zroot/ROOT/default path=/boot/modules type=20 bonustype=44
obj=65597 dataset=zroot/ROOT/default path=/boot/modules/geom_eli.ko type=19 bonustype=44

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     65597    2   128K   128K    80K     512   256K   50.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               168   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 1
	uid     0
	gid     0
	atime	Wed Apr 20 17:37:15 2022
	mtime	Wed Apr 20 17:37:15 2022
	ctime	Wed Apr 20 17:37:15 2022
	crtime	Wed Apr 20 17:37:15 2022
	gen	887
	mode	100555
	size	131080
	parent	115
	links	1
	pflags	40800000104
Indirect blocks:
               0 L1  0:20481000:1000 20000L/1000P F=1 B=887/887 cksum=874e89542b:1f3270c7c16ce:39b480e80eca90d:75b552339f0fcd1b
               0  L0 0:a0143000:12000 20000L/12000P F=1 B=887/887 cksum=16fe63abceef:30eb7f3af17191f:2407cd8cfd04b694:d5743be33ba51582

		segment [0000000000000000, 0000000000020000) size  128K

I don't understand how everything's stored in the first scenario, but I'm adding some more zfs people that can maybe elucidate.
Comment 11 Kyle Evans freebsd_committer freebsd_triage 2022-04-21 02:11:57 UTC
(In reply to Kyle Evans from comment #10)

Upon further inspection, this trailing 8-bytes is... all NUL. Zero-filling the remainder of the buffer when we walk past the last datablk fixes this, but I'm not sure if there's a better indicator that the remainder of the file is just a hole.
Comment 12 Mason Loring Bliss freebsd_triage 2022-04-21 04:21:19 UTC
Tested kevans' test loader.efi on a VM (libvirt/QEMU/KVM, OVMF) and it
didn't choke - it booted normally. Will test again on real hardware that's 
failed using the RC3 loader.efi in the (local, 10 hours from now) morning
and note here how that goes.

Thanks!
Comment 13 freebsd 2022-04-21 06:08:51 UTC
(In reply to Mason Loring Bliss from comment #12)

I write the same for -CURRENT of 20220407, which now boots without fuss (on Framework 'puter).
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2022-04-21 14:26:13 UTC
*** Bug 263343 has been marked as a duplicate of this bug. ***
Comment 15 Mason Loring Bliss freebsd_triage 2022-04-21 16:17:28 UTC
I've tested this again on the hardware where I initially saw the failure, 
a ThinkPad T420, and Kyle's amended loader.efi exhibited no problems.
Comment 16 commit-hook freebsd_committer freebsd_triage 2022-04-21 19:58:08 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=914dc91d12198352b7878a88d30e2a6373a936e1

commit 914dc91d12198352b7878a88d30e2a6373a936e1
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2022-04-21 19:57:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2022-04-21 19:57:24 +0000

    stand: zfs: handle holes at the tail end correctly

    This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and
    clipping the read to what's contained by the block that exists.

    This fixes an issue that arose during the 13.1 release process; in
    13.1-RC1 and later, setting up GELI+ZFS will result in a failure to
    boot.  The culprit is this, which causes us to fail to load geom_eli.ko
    as there's a residual portion after the single datablk that should be
    zeroed out.

    PR:             263407
    Reviewed by:    tsoome
    MFC after:      3 days
    Differential Revision:  https://reviews.freebsd.org/D35019

 stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
Comment 17 Mason Loring Bliss freebsd_triage 2022-04-21 22:42:44 UTC
Random note: Either the correction has made it into RC4 or the RC4 
install media behaves differently for some other reason, as I ran an 
RC4 installation just now that didn't fail. (Legacy, VM, installer 
doing the partitioning.)
Comment 18 Kyle Evans freebsd_committer freebsd_triage 2022-04-21 22:50:34 UTC
(In reply to Mason Loring Bliss from comment #17)

It's the kind of bug that's incredibly hit and miss- -rc4 could be fine, but then the stars re-align and we hit it again in the final release. IMO better to play it safe here.
Comment 19 Mason Loring Bliss freebsd_triage 2022-04-21 22:55:42 UTC
I'm just thinking that it would be convenient to have a way to induce the
failure scenario explicitly so that we're unambiguously testing it and not
just failing to encounter the error based on the preconditions not
existing. (I don't at present understand the issue well enough to have any
solid idea of how to go about this.)
Comment 20 Glen Barber freebsd_committer freebsd_triage 2022-04-22 00:42:31 UTC
(In reply to Mason Loring Bliss from comment #17)
I cannot think of anything that would cause this to "just work" between RC3 and RC4.

Hopefully, however, the fix Kyle has in place for -CURRENT will address the issue in the inevitable RC5 build (already added to the schedule).
Comment 21 Mason Loring Bliss freebsd_triage 2022-04-22 00:58:13 UTC
(In reply to Glen Barber from comment #20)

It seems like it should be possible to use truncate/ftruncate/
posix_fallocate or something to recreate the hole, for testing, but my 
current hurdle is zdb balking with:

zdb failed to hold dataset 'tank/ROOT/default': No such file or directory

I'd love to see before-and-after data like what Kyle noted in comment 10,
but I'm flailing a bit. I'd be grateful for pointers. (I've tried
making the cache file explicit to no avail.)
Comment 22 Graham Perrin freebsd_committer freebsd_triage 2022-04-22 06:48:55 UTC
Thanks, everyone. 

(In reply to Kyle Evans from comment #18)

> It's the kind of bug that's incredibly hit and miss …

Retrospective, the second beta: is it reasonable to treat Pierre Beyssac's observation as an early 'hit'? 

<https://lists.freebsd.org/archives/freebsd-stable/2022-March/000623.html> 

If so: more easily reproducible with RC1 than with BETA2, yes?
Comment 23 Mason Loring Bliss freebsd_triage 2022-04-23 21:33:19 UTC
Random note, I used an identical zdb invocation on a Debian box, and it 
worked fine. Also works fine on a FreeBSD 12 box I just tried. So it's only 
failing on my 13.1 installs. And for completeness, also works fine on a
13.0 fileserver here.
Comment 24 Mason Loring Bliss freebsd_triage 2022-04-23 22:21:37 UTC
Added https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263501 for the
zdb issue.
Comment 25 commit-hook freebsd_committer freebsd_triage 2022-04-24 13:30:10 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0c9c8a4c6459a8cd1b7290ae7dcd9452d3a4dbc1

commit 0c9c8a4c6459a8cd1b7290ae7dcd9452d3a4dbc1
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2022-04-21 19:57:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2022-04-24 03:51:38 +0000

    stand: zfs: handle holes at the tail end correctly

    This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and
    clipping the read to what's contained by the block that exists.

    This fixes an issue that arose during the 13.1 release process; in
    13.1-RC1 and later, setting up GELI+ZFS will result in a failure to
    boot.  The culprit is this, which causes us to fail to load geom_eli.ko
    as there's a residual portion after the single datablk that should be
    zeroed out.

    PR:             263407
    Reviewed by:    tsoome

    (cherry picked from commit 914dc91d12198352b7878a88d30e2a6373a936e1)

 stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
Comment 26 commit-hook freebsd_committer freebsd_triage 2022-04-24 13:30:12 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9a1efa766781c9dd1eb5975ce3a5c8a64d89cd18

commit 9a1efa766781c9dd1eb5975ce3a5c8a64d89cd18
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2022-04-21 19:57:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2022-04-24 03:49:22 +0000

    stand: zfs: handle holes at the tail end correctly

    This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and
    clipping the read to what's contained by the block that exists.

    This fixes an issue that arose during the 13.1 release process; in
    13.1-RC1 and later, setting up GELI+ZFS will result in a failure to
    boot.  The culprit is this, which causes us to fail to load geom_eli.ko
    as there's a residual portion after the single datablk that should be
    zeroed out.

    PR:             263407
    Reviewed by:    tsoome

    (cherry picked from commit 914dc91d12198352b7878a88d30e2a6373a936e1)

 stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
Comment 27 Kyle Evans freebsd_committer freebsd_triage 2022-04-24 13:34:53 UTC
I've sent this off for formal re@ approval, this will likely get pushed to releng/13.1 tonight or tomorrow morning.
Comment 28 commit-hook freebsd_committer freebsd_triage 2022-04-24 23:16:02 UTC
A commit in branch releng/13.1 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=59288c719dc2af9b59e33a88f9b138c5bac38335

commit 59288c719dc2af9b59e33a88f9b138c5bac38335
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2022-04-21 19:57:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2022-04-24 23:15:07 +0000

    stand: zfs: handle holes at the tail end correctly

    This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and
    clipping the read to what's contained by the block that exists.

    This fixes an issue that arose during the 13.1 release process; in
    13.1-RC1 and later, setting up GELI+ZFS will result in a failure to
    boot.  The culprit is this, which causes us to fail to load geom_eli.ko
    as there's a residual portion after the single datablk that should be
    zeroed out.

    PR:             263407
    Reviewed by:    tsoome
    Approved by:    re (gjb)

    (cherry picked from commit 914dc91d12198352b7878a88d30e2a6373a936e1)
    (cherry picked from commit 0c9c8a4c6459a8cd1b7290ae7dcd9452d3a4dbc1)

 stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
Comment 29 Kyle Evans freebsd_committer freebsd_triage 2022-04-24 23:16:37 UTC
Thank you for flying FreeBSD Airlines, have a nice release.
Comment 30 Mason Loring Bliss freebsd_triage 2022-04-28 21:45:11 UTC
Just a quick note, RC5 boots correctly and zdb provides data.