Summary: | RC3 Guided ZFS on root with encryption unmountable | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Daniel Briggs <bsdbugs> | ||||||
Component: | misc | Assignee: | Kyle Evans <kevans> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Some People | CC: | allanjude, bugs, elcabra, emaste, freebsd, freqlabs, grahamperrin, kevans, mason, mav, mm, pb, re | ||||||
Priority: | --- | Flags: | kevans:
mfc-stable13+
kevans: mfc-stable12+ |
||||||
Version: | 13.1-STABLE | ||||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
URL: | https://reviews.freebsd.org/D35019 | ||||||||
Attachments: |
|
Description
Daniel Briggs
2022-04-18 22:55:15 UTC
Maybe related: main-n254654-d4e8207317c results in "no pools available to import" <https://lists.freebsd.org/archives/freebsd-current/2022-April/001754.html> <https://markmail.org/message/xua4rc3qtprkqiwu> and (threading broken) <https://markmail.org/message/77js65vrngjgumco> Created attachment 233334 [details] photograph (In reply to Graham Perrin from comment #1) … on second thoughts, maybe _not_ related, although note the comment from Tom (Thomas) Laus at <https://lists.freebsd.org/archives/freebsd-current/2022-April/001824.html>. (In reply to Daniel Briggs from comment #0) Reproducible with FreeBSD-13.1-RC3-amd64-memstick.img on an HP ProBook 450 G8. Reproducible with FreeBSD-13.1-RC3-amd64-disc1.iso on the same computer, photograph attached. Created attachment 233335 [details] frame from a recording (In reply to Graham Perrin from comment #2) Copy typed from a frame from a recording (apologies for any typo): ---- Loading kernel... … Loading configured modules... /boot/kernel/geom_eli.ko elf64_obj_loadimage: read section headers failed can't load file '/boot/kernel/geom_eli.ko': input/output error /boot/kernel/cryptodev.ko size 0xa158 at 0x2131000 can't find '/etc/hostid' /boot/kernel/zfs.ko size 0x5b93a0 at 0x213c000 /boot/entropy size=0x1000 staging 0x2b000000 (not copying) tramp 0x33b3d000 PT4 0x339ef000 Start @ 0xffffffff80385000 ... EFI framebuffer information: … I am not using any GELI partitions; got the issue on 14-CURRENT of 20220407 (latest commit of 0c348b97eb0) on Framework laptop (c 202202; Intel i5-1135G7; NVMe disk; UEFI only). There was|is no issue with "gpart show"; no issue with import & export of root pool in "Live CD" mode; went back to earlier working boot environment of 20220402 (main-n254246-6833ac673d9). As a datapoint, I'd noticed this recently, in my case using my "ZFS and GELI by hand" guide as the install method. On reading this bug (thanks parv) I ran a test install on a VM (libvirt), letting the installer partition for me, and the issue was there. I confirmed that it's also there in RC1, and then performed some more comprehensive testing. I fished out all the ISOs I have locally as the older ones don't seem to be available from freebsd.org any more, at least where I looked. I ran most tests on a VM (the legacy results) and a couple sorts of hardware (both UEFI results). Tests run so far: STABLE-20211111 UEFI and legacy, boots with and without GELI BETA1 UEFI and legacy, boots with and without GELI BETA2 legacy, boots with and without GELI BETA3 legacy, boots with and without GELI RC1 legacy, GELI fails to boot RC2 legacy, GELI fails to boot RC3 legacy, GELI fails to boot So, this issue seems to have emerged between BETA3 and RC1. Martin, Is it possible this was introduced with 99c3f15774595c1163ecec87aa8cb157ccdc7d96? That is the only ZFS-related change between BETA3 and RC1. I have started debugging and here is my finding: There are no errors with 13.1-BETA3. But this doesn't mean it is "fine". When doing a 13.1-RC3 install with ZFS+GELI, the loader fails to load /boot/kernel/geom_eli.ko with the following error message: elf64_obj_loadimage: read section headers failed That way geom_eli.ko is not loaded and ZFS is of course unable to find the "zroot" pool when trying to mount root, because the geli device does not exist. The file has the correct checksum. But the loader can't cope with it - unknown reason. You can load other modules. I have found this workaround: When you exit to shell after running the automated installer (before doing reboot) and do the following: cp /boot/kernel/geom_eli.ko /tmp rm /boot/kernel/geom_eli.ko cp /tmp/geom_eli.ko /boot/kernel/geom_eli.ko And then reboot, the system boots fine. Is this a problem in the ZFS support of our loader (not being able to read a specific file out of ZFS) or elsewhere? In BETA3, the geom_eli.ko might be just located on a different position on the root pool and that why it is being read correctly by the loader. Martin, Just to confirm, kevans had me capture some output and what I see matches: Loading configured modules...^M |^H/^H-^H\^H|^H/boot/kernel/zfs.ko /^H-^Hsize 0x5b93a0 at 0x2131000^M \^H|^H/boot/kernel/geom_eli.ko ^M elf64_obj_loadimage: read section headers failed^M ^M /^H-^H/boot/kernel/geom_eli.ko ^M elf64_obj_loadimage: read section headers failed^M ^M can't load file '/boot/kernel/geom_eli.ko': input/output error^M \^H|^H/boot/entropy size=0x1000^M /^H-^H\^H|^H/boot/kernel/cryptodev.ko size 0xa158 at 0x26ec000^M /^H-^Hcan't find '/etc/hostid'^M \^H---<<BOOT>>---^M And further confirmation, your workaround works and previously-unbootable systems boot. Bizarre stuff. (In reply to Martin Matuska from comment #7) To summarize some of what I've discovered: loader fails to read it because we process the first datablk and then still have 8 bytes left to read. Note the zdb output: obj=14 dataset=zroot/ROOT/default path=/boot type=20 bonustype=44 obj=77 dataset=zroot/ROOT/default path=/boot/kernel type=20 bonustype=44 obj=16532 dataset=zroot/ROOT/default path=/boot/kernel/geom_eli.ko type=19 bonustype=44 Object lvl iblk dblk dsize dnsize lsize %full type 16532 1 128K 128K 72K 512 128K 100.00 ZFS plain file (K=inherit) (Z=inherit=lz4) 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED dnode maxblkid: 0 uid 0 gid 0 atime Wed Apr 20 16:36:09 2022 mtime Thu Apr 14 05:24:57 2022 ctime Wed Apr 20 16:36:13 2022 crtime Thu Apr 14 05:24:57 2022 gen 105 mode 100555 size 131080 parent 77 links 1 pflags 40800000104 Indirect blocks: 0 L0 0:12eb72000:12000 20000L/12000P F=1 B=105/105 cksum=16fe63abceef:30eb7f3af17191f:2407cd8cfd04b694:d5743be33ba51582 segment [0000000000000000, 0000000000020000) size 128K There's one indirect block that consists of 128k data, but the file size is 128k + 8 bytes. zdb doesn't indicate any embedded data, so it's not clear where this remaining 8 bytes comes from but it seems likely that we're just mishandling this situation. When you copy the file away, you get this instead, which loader *can* grok: obj=14 dataset=zroot/ROOT/default path=/boot type=20 bonustype=44 obj=115 dataset=zroot/ROOT/default path=/boot/modules type=20 bonustype=44 obj=65597 dataset=zroot/ROOT/default path=/boot/modules/geom_eli.ko type=19 bonustype=44 Object lvl iblk dblk dsize dnsize lsize %full type 65597 2 128K 128K 80K 512 256K 50.00 ZFS plain file (K=inherit) (Z=inherit=lz4) 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED dnode maxblkid: 1 uid 0 gid 0 atime Wed Apr 20 17:37:15 2022 mtime Wed Apr 20 17:37:15 2022 ctime Wed Apr 20 17:37:15 2022 crtime Wed Apr 20 17:37:15 2022 gen 887 mode 100555 size 131080 parent 115 links 1 pflags 40800000104 Indirect blocks: 0 L1 0:20481000:1000 20000L/1000P F=1 B=887/887 cksum=874e89542b:1f3270c7c16ce:39b480e80eca90d:75b552339f0fcd1b 0 L0 0:a0143000:12000 20000L/12000P F=1 B=887/887 cksum=16fe63abceef:30eb7f3af17191f:2407cd8cfd04b694:d5743be33ba51582 segment [0000000000000000, 0000000000020000) size 128K I don't understand how everything's stored in the first scenario, but I'm adding some more zfs people that can maybe elucidate. (In reply to Kyle Evans from comment #10) Upon further inspection, this trailing 8-bytes is... all NUL. Zero-filling the remainder of the buffer when we walk past the last datablk fixes this, but I'm not sure if there's a better indicator that the remainder of the file is just a hole. Tested kevans' test loader.efi on a VM (libvirt/QEMU/KVM, OVMF) and it didn't choke - it booted normally. Will test again on real hardware that's failed using the RC3 loader.efi in the (local, 10 hours from now) morning and note here how that goes. Thanks! (In reply to Mason Loring Bliss from comment #12) I write the same for -CURRENT of 20220407, which now boots without fuss (on Framework 'puter). *** Bug 263343 has been marked as a duplicate of this bug. *** I've tested this again on the hardware where I initially saw the failure, a ThinkPad T420, and Kyle's amended loader.efi exhibited no problems. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=914dc91d12198352b7878a88d30e2a6373a936e1 commit 914dc91d12198352b7878a88d30e2a6373a936e1 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2022-04-21 19:57:24 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2022-04-21 19:57:24 +0000 stand: zfs: handle holes at the tail end correctly This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and clipping the read to what's contained by the block that exists. This fixes an issue that arose during the 13.1 release process; in 13.1-RC1 and later, setting up GELI+ZFS will result in a failure to boot. The culprit is this, which causes us to fail to load geom_eli.ko as there's a residual portion after the single datablk that should be zeroed out. PR: 263407 Reviewed by: tsoome MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D35019 stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) Random note: Either the correction has made it into RC4 or the RC4 install media behaves differently for some other reason, as I ran an RC4 installation just now that didn't fail. (Legacy, VM, installer doing the partitioning.) (In reply to Mason Loring Bliss from comment #17) It's the kind of bug that's incredibly hit and miss- -rc4 could be fine, but then the stars re-align and we hit it again in the final release. IMO better to play it safe here. I'm just thinking that it would be convenient to have a way to induce the failure scenario explicitly so that we're unambiguously testing it and not just failing to encounter the error based on the preconditions not existing. (I don't at present understand the issue well enough to have any solid idea of how to go about this.) (In reply to Mason Loring Bliss from comment #17) I cannot think of anything that would cause this to "just work" between RC3 and RC4. Hopefully, however, the fix Kyle has in place for -CURRENT will address the issue in the inevitable RC5 build (already added to the schedule). (In reply to Glen Barber from comment #20) It seems like it should be possible to use truncate/ftruncate/ posix_fallocate or something to recreate the hole, for testing, but my current hurdle is zdb balking with: zdb failed to hold dataset 'tank/ROOT/default': No such file or directory I'd love to see before-and-after data like what Kyle noted in comment 10, but I'm flailing a bit. I'd be grateful for pointers. (I've tried making the cache file explicit to no avail.) Thanks, everyone. (In reply to Kyle Evans from comment #18) > It's the kind of bug that's incredibly hit and miss … Retrospective, the second beta: is it reasonable to treat Pierre Beyssac's observation as an early 'hit'? <https://lists.freebsd.org/archives/freebsd-stable/2022-March/000623.html> If so: more easily reproducible with RC1 than with BETA2, yes? Random note, I used an identical zdb invocation on a Debian box, and it worked fine. Also works fine on a FreeBSD 12 box I just tried. So it's only failing on my 13.1 installs. And for completeness, also works fine on a 13.0 fileserver here. Added https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263501 for the zdb issue. A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0c9c8a4c6459a8cd1b7290ae7dcd9452d3a4dbc1 commit 0c9c8a4c6459a8cd1b7290ae7dcd9452d3a4dbc1 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2022-04-21 19:57:24 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2022-04-24 03:51:38 +0000 stand: zfs: handle holes at the tail end correctly This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and clipping the read to what's contained by the block that exists. This fixes an issue that arose during the 13.1 release process; in 13.1-RC1 and later, setting up GELI+ZFS will result in a failure to boot. The culprit is this, which causes us to fail to load geom_eli.ko as there's a residual portion after the single datablk that should be zeroed out. PR: 263407 Reviewed by: tsoome (cherry picked from commit 914dc91d12198352b7878a88d30e2a6373a936e1) stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9a1efa766781c9dd1eb5975ce3a5c8a64d89cd18 commit 9a1efa766781c9dd1eb5975ce3a5c8a64d89cd18 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2022-04-21 19:57:24 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2022-04-24 03:49:22 +0000 stand: zfs: handle holes at the tail end correctly This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and clipping the read to what's contained by the block that exists. This fixes an issue that arose during the 13.1 release process; in 13.1-RC1 and later, setting up GELI+ZFS will result in a failure to boot. The culprit is this, which causes us to fail to load geom_eli.ko as there's a residual portion after the single datablk that should be zeroed out. PR: 263407 Reviewed by: tsoome (cherry picked from commit 914dc91d12198352b7878a88d30e2a6373a936e1) stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) I've sent this off for formal re@ approval, this will likely get pushed to releng/13.1 tonight or tomorrow morning. A commit in branch releng/13.1 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=59288c719dc2af9b59e33a88f9b138c5bac38335 commit 59288c719dc2af9b59e33a88f9b138c5bac38335 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2022-04-21 19:57:24 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2022-04-24 23:15:07 +0000 stand: zfs: handle holes at the tail end correctly This mirrors dmu_read_impl(), zeroing out the tail end of the buffer and clipping the read to what's contained by the block that exists. This fixes an issue that arose during the 13.1 release process; in 13.1-RC1 and later, setting up GELI+ZFS will result in a failure to boot. The culprit is this, which causes us to fail to load geom_eli.ko as there's a residual portion after the single datablk that should be zeroed out. PR: 263407 Reviewed by: tsoome Approved by: re (gjb) (cherry picked from commit 914dc91d12198352b7878a88d30e2a6373a936e1) (cherry picked from commit 0c9c8a4c6459a8cd1b7290ae7dcd9452d3a4dbc1) stand/libsa/zfs/zfsimpl.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) Thank you for flying FreeBSD Airlines, have a nice release. Just a quick note, RC5 boots correctly and zdb provides data. |