Bug 236981 - loader_lua does not boot from encrypted ZFS pool after upgrading from r345243 to r345759
Summary: loader_lua does not boot from encrypted ZFS pool after upgrading from r34524...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: Normal Affects Some People
Assignee: Ian Lepore
URL:
Keywords: needs-qa, regression
Depends on:
Blocks:
 
Reported: 2019-04-03 08:57 UTC by rz-rpi03
Modified: 2019-10-24 03:53 UTC (History)
4 users (show)

See Also:
koobs: mfc-stable12?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description rz-rpi03 2019-04-03 08:57:16 UTC
The base r345759 boot failes because it can not find the encrypted ZFS pool any more.

The console looks like:

 BTX loader 1.00 BTX version 1.02
 Consoles: internal video/keyboard
 BIOS drive C: is disk0
 BIOS drive D: is disk1
 BIOS drive E: is disk2
 BIOS 630kb/30... available memory

 FreeBSD/x86 bootstrap loader, Revision 1.1
 ZFS: can't find pool by guid
 ZFS: can't find pool by guid
 ZFS: can't find pool by guid
 Startup error in /boot/lua/loader.lua:
 LUA ERROR: cannot open /boot/lua/loader.lua: invalid argument.

 can't load 'kernel'
 Type '?' for a list of commands...

I managed to get the machine back by
- booting from FreeBSD-13.0-CURRENT-amd64-20190321-r345355-memstick.img
- attaching the partitions containing the ZFS boot pool with "geli attach"
- using "zpool import -f" to import the pool and
- replacing the base r345759 loader (linked with loader_lua) with the base r345243 loader_lua.
So the machine is currently bootstraping with base r345243 loader_lua and
running the base r345759 kernel and world.

Searching the commit logs between base r345243 and base r345759 leads me to suspect
base r345330 "loader: fix loading of kernels with . in path", but I did not had
the time to just reverse base r345330 and give the resulting loader_lua a try.

Ralf
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2019-04-03 09:01:15 UTC
CC committer of base r345330
Comment 2 Ed Maste freebsd_committer freebsd_triage 2019-04-03 13:02:53 UTC
> ZFS: can't find pool by guid
suggests to me that something's going wrong before we get to the point where the r345330 change takes effect; it would be great if you can try testing with just base r345330 reverted.

I've CC'd the author of the base r345330 patch.
Comment 3 rz-rpi03 2019-04-03 21:19:57 UTC
(In reply to Ed Maste from comment #2)

Ed is right, I was wrong.

Just reverting base r345330 results in the same messages when trying to boot.

After buildung the new loader_lua (without base r345330) I noticed that the
resulting file has excatly the same size as the one with r345330 but I gave
it a try.

The working base r345243 loader_lua is around 32K bigger than the failing
ones.

$ ls -li loader*
306844 -r-xr-xr-x  3 root  wheel  430080 Apr  3 22:31 loader
 65536 -r-xr-xr-x  2 root  wheel  397312 Apr  1 13:32 loader.20190401
492114 -r-xr-xr-x  2 root  wheel  397312 Apr  3 22:30 loader.20190403
306762 -r--r--r--  1 root  wheel    7418 Apr  1 13:32 loader.4th
  4987 -rw-r--r--  1 root  wheel     888 Oct 15  2017 loader.conf
306855 -r-xr-xr-x  2 root  wheel  475648 Apr  1 13:32 loader.efi
611509 -r--r--r--  1 root  wheel   15084 Aug 12  2018 loader.help
 49530 -r-xr-xr-x  1 root  wheel  434176 Aug 11  2018 loader.old
306779 -r--r--r--  1 root  wheel     461 Apr  1 13:32 loader.rc
306843 -r-xr-xr-x  1 root  wheel  348160 Apr  1 13:32 loader_4th
306854 -r-xr-xr-x  1 root  wheel  410112 Apr  1 13:32 loader_4th.efi
352728 -r-xr-xr-x  1 root  wheel  380928 Mar 17 16:28 loader_4th.old
306844 -r-xr-xr-x  3 root  wheel  430080 Apr  3 22:31 loader_lua
 65580 -r-xr-xr-x  1 root  wheel  430080 Mar 17 16:28 loader_lua.20190317
 65536 -r-xr-xr-x  2 root  wheel  397312 Apr  1 13:32 loader_lua.20190401
492114 -r-xr-xr-x  2 root  wheel  397312 Apr  3 22:30 loader_lua.20190403
306855 -r-xr-xr-x  2 root  wheel  475648 Apr  1 13:32 loader_lua.efi
352729 -r-xr-xr-x  1 root  wheel  430080 Mar 17 16:28 loader_lua.old
306845 -r-xr-xr-x  1 root  wheel  299008 Apr  1 13:32 loader_simp
306856 -r-xr-xr-x  1 root  wheel  350720 Apr  1 13:32 loader_simp.efi
352730 -r-xr-xr-x  1 root  wheel  319488 Mar 17 16:28 loader_simp.old
$ 

The .20190317 file is from r345243, .20190401 is at r345759 and .20190403
is r345759 build without r345330.
Comment 4 rz-rpi03 2019-04-05 20:02:33 UTC
(In reply to iz-rpi03 from comment #3)
The size of loader_lua does not seem to be related to the ability to find
the ZFS pool. All newly build loader_lua files have the smaller size and some
of them are working as expected.

I am currently trying to find the release when the failure happend
by building release versions of loader_lua.
For this I bisect the (head) releases in between and trying out the resulting
loader_lua.

From the running r345759 world I try to achive this with the following
commands, e.g.:

rel=345475
svnlite up -r ${rel} .
make -j8 buildworld
cp /usr/obj/usr/src/amd64.amd64/stand/i386/loader_lua/loader_lua \
   /boot/loader_lua.r${rel}
cp /boot/loader_lua.r${rel} /boot/loader_lua

I hope this is not a silly idea.
Comment 5 rz-rpi03 2019-04-07 22:43:24 UTC
(In reply to iz-rpi03 from comment #4)
Finally I found the change which causes the issue.
Verified by switching between with the working and non working relase several times.

loader_lua from base r345476 is working as expected.
loader_lua from base r345477 does not find the pool.

The SHA256 changes from loader_lua.r345476 = deb308b2f112101fd6f6fd882d05e70540b0413d9cf3049762e1c12f8ea6e106
to loader_lua.r345477 = fbd759e2e00e39d4176717c88fd9da830783477662fe0975ea417fd0943ce695
and remains constant till base r345763 at least.
Comment 6 Ed Maste freebsd_committer freebsd_triage 2019-04-08 12:51:13 UTC
Review for identified commit: https://reviews.freebsd.org/D19262
Comment 7 commit-hook freebsd_committer freebsd_triage 2019-04-25 15:09:26 UTC
A commit references this bug:

Author: ian
Date: Thu Apr 25 15:09:21 UTC 2019
New revision: 346675
URL: https://svnweb.freebsd.org/changeset/base/346675

Log:
  Restore the ability to open a raw disk or partition in loader(8).

  The disk_open() function searches for "the best partition" when slice and
  partition information is not provided as part of the device name.  As of
  r345477 the slice and partition fields of a disk_devdesc are initialized to
  D_SLICEWILD and D_PARTWILD; in the past they were initialized to -1, which
  was sometimes interpreted as meaning 'wildcard' and sometimes as 'open the
  raw partition' depending on the context.  So as an unintended side effect of
  r345477 it became basically impossible to ever open a disk or partition
  without doing the 'best partition' search.  One visible effect of that was
  the inability to open the raw disk to read the partition table correctly in
  zfs_probe_dev(), leading to failures to find the zfs pool unless it was on
  the first partition.

  Now instead of always initializing slice and partition to wildcards, the
  disk_parsedev() function initializes them based on the presence of a
  path/file name following the device.  If there is any path or filename
  following the ':' that ends the device name, then slice and partition are
  initialized to D_SLICEWILD and D_PARTWILD.  If there is nothing after the
  ':' then it is considered to be a request to open the raw device or
  partition itself (not a file stored within it), and the fields are
  initialized to D_SLICENONE and D_PARTNONE.

  With this change in place, all the tests in src/tools/boot are succesful
  again, including the recently-added cases of booting from a zfs pool on
  a partition other than slice 1 of the device.

  PR:		236981

Changes:
  head/stand/common/disk.c
Comment 8 rz-rpi03 2019-04-26 12:07:16 UTC
I can confirm that with r346675, actual r346736, loader_lua is booting the system
from a encrypted ZFS pool again.
So I close this bug report.

Thank you.
Comment 9 commit-hook freebsd_committer freebsd_triage 2019-10-24 03:53:26 UTC
A commit references this bug:

Author: kevans
Date: Thu Oct 24 03:52:35 UTC 2019
New revision: 354006
URL: https://svnweb.freebsd.org/changeset/base/354006

Log:
  MFC r345477, r346675, r346984, r348748

  r345477:
  Distinguish between "no partition" and "choose best partition" with a
  constant.

  The values of the d_slice and d_partition fields of a disk_devdesc have a
  few values with special meanings in the disk_open() routine. Through various
  evolutions of the loader code over time, a d_partition value of -1 has
  meant both "use the first ufs partition found in the bsd label" and "don't
  open a bsd partition at all, open the raw slice."

  This defines a new special value of -2 to mean open the raw slice, and it
  gives symbolic names to all the special values used in d_slice and
  d_partition, and adjusts all existing uses of those fields to use the new
  constants.

  The phab review for this timed out without being accepted, but I'm still
  citing it below because there is useful commentary there.

  r346675:
  Restore the ability to open a raw disk or partition in loader(8).

  The disk_open() function searches for "the best partition" when slice and
  partition information is not provided as part of the device name.  As of
  r345477 the slice and partition fields of a disk_devdesc are initialized to
  D_SLICEWILD and D_PARTWILD; in the past they were initialized to -1, which
  was sometimes interpreted as meaning 'wildcard' and sometimes as 'open the
  raw partition' depending on the context.  So as an unintended side effect of
  r345477 it became basically impossible to ever open a disk or partition
  without doing the 'best partition' search.  One visible effect of that was
  the inability to open the raw disk to read the partition table correctly in
  zfs_probe_dev(), leading to failures to find the zfs pool unless it was on
  the first partition.

  Now instead of always initializing slice and partition to wildcards, the
  disk_parsedev() function initializes them based on the presence of a
  path/file name following the device.  If there is any path or filename
  following the ':' that ends the device name, then slice and partition are
  initialized to D_SLICEWILD and D_PARTWILD.  If there is nothing after the
  ':' then it is considered to be a request to open the raw device or
  partition itself (not a file stored within it), and the fields are
  initialized to D_SLICENONE and D_PARTNONE.

  With this change in place, all the tests in src/tools/boot are succesful
  again, including the recently-added cases of booting from a zfs pool on
  a partition other than slice 1 of the device.

  r346984:
  Use D_PARTISGPT rather than bare 255

  These three cases dovetail with other places in the code where we use
  or set D_PARTISGPT when we mean that the partitioning scheme is
  GPT. Use this #define to make the code easier to undertand.

  r348748:
  loader: disk_open() should honor D_PARTNONE

  The D_PARTNONE is documented to make it possible to open raw MBR
  partition, but the current disk_open() does not really implement this
  statement.

  The current code is checking partition against -1 (D_PARTNONE) but does
  attempt to open partition table in case we do have FreeBSD MBR partition
  type.
  Instead, we should check -2 (D_PARTWILD).

  In case we do have MBR + BSD label, this code is only working because
  by default, the first BSD partiton is created starting with relative sector
  0, and we can still access the BSD table from that MBR slice.

  PR:		236981

Changes:
_U  stable/12/
  stable/12/stand/common/disk.c
  stable/12/stand/common/disk.h
  stable/12/stand/efi/libefi/efipart.c
  stable/12/stand/efi/loader/main.c
  stable/12/stand/i386/libi386/biosdisk.c
  stable/12/stand/libsa/zfs/zfs.c
  stable/12/stand/mips/beri/loader/beri_disk_cfi.c
  stable/12/stand/mips/beri/loader/beri_disk_sdcard.c
  stable/12/stand/uboot/common/main.c
  stable/12/stand/uboot/lib/disk.c
  stable/12/stand/usb/storage/umass_loader.c
  stable/12/stand/userboot/userboot/main.c
  stable/12/stand/userboot/userboot/userboot_disk.c