Hello, in 13.1-RELEASE and -CURRENT as of 18054d0220cf it's impossible to boot from an GELI encrypted root volume, because the 'currdev' loader variable is set to a wrong string. Thus the loader is unable to load the boot data from the disk. I've bisect this in 13.1-RELEASE. It was broken in commit bc9154a208248, which was cherry picked from b4cb3fe0e39a. Setup ----- I've tested BIOS boot from GPT with / on GELI. / is decrypted by gptboot. The bootchain is BIOS -> pmbr (read from the MBR) -> gptboot (read from a freebsd-boot partition) -> /boot/loader. geli show: => 40 41942960 vtbd0 GPT (20G) 40 256 1 freebsd-boot (128K) 296 4194304 2 freebsd-swap (2.0G) 4194600 37748400 3 freebsd-ufs (18G) It doesn't matter if it's real hardware or - like in this example - an VM. It happens regardless how many devices are attached. Problem ------- Try to boot the system. /boot/loader errors out with "ERROR: cannot open /boot/lua/loader.lua: no such file or directory." This is caused by the currdev variable get set to the wrong string: # show currdev gelidisk0p3: The 'geli' at the beginning of the string is wrong. lsdev lists the device with its correct name, disk0p3. loaders build before bc9154a208248 are working fine, currdev is set to disk0p3. Impact ------ This makes it impossible to boot from an encrypted /. At least not without manual interactions, like typing the correct path into the loader prompt. I've testes only BIOS with GPT and UFS. I don't know if other combinations are also impacted.
I confirm this is a problem with the mentioned encrypted setup - it bricked one of my installs. The dirty workaround might be to replace the /boot/loader with the one from the 13.0-RELEASE, that is, the last version when this disk encryption setup seems to have worked. Great job on tracking the bug.
I had a deeper look. This is a long standing bug, it's been there since forever. bc9154a208248 exposed it and broke the BIOS bootloader on UFS on GELI encrypted devices. ZFS is not broken, because it uses another code path. The bootloader tracks devices in `devsw` structs. These structs have a field `dv_name`, storing the name of the device type. For everything which qualifies as a harddisk the string is just "disk". The BIOS loader (i386 loader in the code) derives the actual devicename from `dv_name` and some unit number magic, drive C: ends up as unit 0. That happens in `i386_fmtdev()`. GELI employs a hack: GELI devices are treated like disks, the GELI initialisation code hijacks the `devsw` struct of the given disk and overwrites it with a specialized version. The hack is explained in stand/libsa/geli/gelidev.c:46. All lines are against 14-CURRENT as of 1326017849ee. This specialized version sets `dv_name` to "gelidisk". That's were the wrong string comes from. Until bc9154a208248 the `currdev` variable got set before GELI was initalized. `dv_name` was still set to "disk". But with bc9154a208248 `currdev` is set after GELI was initialized and `dv_name` is already "gelidisk". The code doesn't take that into account and we end up with a wrong string. I don't know enough about the bootloader to decide what the best fix is. We could add logic to handle "gelidisk", since GELI disks are just disks that looks overly complicated and error prone. We can introduce a new device type `DEVT_GELI` or something like that. That would require a lot of code handling that type, most of it would be a copy of the generic disk code. The easiest way is to change "gelidisk" to just "disk". That should be okay, because there's no code which handles "gelidisk" and GELI disks are disks. The attached patch does exactly that. A oneliner, tested on my VM. I'm also attaching a prebuild loader with the patch applied. It can be used to recover bricked system, just replace /boot/loader with it.
Created attachment 234418 [details] Remove special dv_name for GELI disks, handle them like normal disks.
Created attachment 234419 [details] Prebuild /boot/loader with the patch applied, can be used to recover bricked systems
(In reply to yamagi from comment #3) Hi yamagi, Thanks for the amazing effort. I have tested your patch on three main-n256001-f1294737023 systems; they all booted expected and without intervention past the GELI passphrase prompt. Thanks.
Does this only impact UFS?
It does NOT impact ZFS. I haven't looked into the details. I thing it has something to do with ZFS ending up with a currdev like 'zfs:system/ROOT/default:'. Since the device type isn't 'disk', the disk <-> gelidisk problem doesn't matter. Or something like that. That ZFS isn't affected is properly the reason why this wasn't noticed before. Most users boot from ZFS these days. I haven't tested cd9660, ext2fs and msdosfs. From reading the source I would guess that they're affected, because they're using more or less the same path as UFS.
Over to tsoome@ as committer of the identified commit... feel free to reassign.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e417249016efcca73c9edad21b94b1315bc44601 commit e417249016efcca73c9edad21b94b1315bc44601 Author: Toomas Soome <tsoome@FreeBSD.org> AuthorDate: 2022-06-20 06:51:44 +0000 Commit: Toomas Soome <tsoome@FreeBSD.org> CommitDate: 2022-06-20 07:10:14 +0000 loader: GELI encrypted disk should still use device name disk geli_probe_and_attach() does pick geli_devsw structure for encrypted disks, the implementation depends on device name "disk" when device type is DEVT_DISK, but geli_devsw is setting name field "gelidisk". PR: 264282 Submitted by: yamagi@yamagi.org Reported by: yamagi@yamagi.org MFC after: 2 weeks stand/libsa/geli/gelidev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=67f8b6d985dbc85682d6f8b2dd5117068c19ca59 commit 67f8b6d985dbc85682d6f8b2dd5117068c19ca59 Author: Toomas Soome <tsoome@FreeBSD.org> AuthorDate: 2022-06-20 06:51:44 +0000 Commit: Toomas Soome <tsoome@FreeBSD.org> CommitDate: 2022-07-04 06:14:43 +0000 loader: GELI encrypted disk should still use device name disk geli_probe_and_attach() does pick geli_devsw structure for encrypted disks, the implementation depends on device name "disk" when device type is DEVT_DISK, but geli_devsw is setting name field "gelidisk". PR: 264282 Submitted by: yamagi@yamagi.org Reported by: yamagi@yamagi.org (cherry picked from commit e417249016efcca73c9edad21b94b1315bc44601) stand/libsa/geli/gelidev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
<http://markmail.org/message/zwnm4lgzbvbtucei> in the midst of <https://lists.freebsd.org/archives/freebsd-hackers/2022-December/001714.html> discussion about a possible patch for 12.4-RELEASE.
^Triage: MFC to 12 now Overcome By Events.