Bug 238257 - zfsloader: 11.2-STABLE r345498 to r347183 update leaves unbootable system
Summary: zfsloader: 11.2-STABLE r345498 to r347183 update leaves unbootable system
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: needs-qa, regression
Depends on:
Blocks:
 
Reported: 2019-05-31 08:13 UTC by Scott Bennett
Modified: 2019-10-20 12:49 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Scott Bennett 2019-05-31 08:13:05 UTC
After installing the new r347183 kernel and rebooting, the new kernel appeared to be working, so I proceeded to do the mergemaster -p -F, make installworld, mergemaster -F, reboot sequence.  After entering the GELI passphrase for the boot pool primary device, I got a message beginning with "BTX" followed by several lines of hexadecimal with spaces interspersed.  I got help from another subscriber to the freebsd-stable list, a suggestion to wait one second after entering the GELI passphrase, then hit space.  This action resulted in a prompt showing the path to the broken zfsloader and allowing entry of a different path.  I entered the same with ".old" appended, and that got me a boot menu.  Once the system was running, I renamed the broken zfsloader to zfsloader.bad.r347183 and renamed zfsloader.old to zfsloader and added a hard link to it called zfsloader.good.r345498.  I then reactivated the r345498 boot environment and eventually, after further exploration, rebooted, so I am currently back to running r345498.

Since reverting, I have continued to update my source tree, but have not bothered to run a "make buildworld" because I have seen no further updates to either loader or zfsloader from r347183 through r348441.  IOW, I cannot update my FreeBSD system any further until this bug is fixed unless I want to implement a local addition to the updating procedure to add steps to reinstall a working-but-outdated copy of zfsloader after the "make installworld" step and remember to do that without fail for every update.

For the unsuspecting, but affected, FreeBSD user, who may not know how to get the second-stage boot code to ask for a new path to a working boot loader (as I did not), it would be unconscionable to release 11.3 before zfsloader is fixed.  I do not know whether loader is similarly broken.

I consider such a bug to be of a severity just less than critical because there is a way to get the system booted *provided* one knows the trick.  I have not seen this trick documented anywhere, and I remain grateful to crahman Ta gmail for responding to my plea for help on the -stable list with the instructions for that trick.  IMO, the trick should be included and *highlighted* in the Handbook's instructions for updating from source.
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2019-05-31 08:55:27 UTC
Thank you for the report Scott

CC committer(s) of base r344399 "MFC GELI Loader Improvements", base r346475 and base r346549 base r346483 which may be indicated (relating to loader/geli)

@Scott Can you provide a boot log of the failure (or screenshot thereof) as an attachment?
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2019-05-31 09:01:21 UTC
See Also base r336537

20180720:
	zfsloader's functionality has now been folded into loader.
	zfsloader is no longer necesasary once you've updated your
	boot blocks. For a transition period, there will be a symlink
	in place from zfsloader to loader to allow a smooth transition
	until the boot blocks can be updated
Comment 3 Scott Bennett 2019-05-31 12:10:45 UTC
From its description, r346549 doesn't appear to have anything to do with this bug.
Yes, I am going to have to reboot soon anyway due to the kernel's memory mismanagement bug(s) (as of r345498, at least) that let it violate vm.max_wired and, it appears, vm.kmem_size_max until the machine becomes difficult to use.  Mine has now been up (r345498) for a bit over nine days, and it is being a pain again.  Logs and screenshots?  In the boot process?  This is running on bare hardware, not a VM.  I will have to write what I see down on paper, and will enter it into a comment here once the system is back up and running on r345498 again.  Give me a couple of days to get to it due to outside issues disrupting my sleep habits and the rest of my time.
Comment 4 Scott Bennett 2019-05-31 12:19:02 UTC
I can't easily confirm this at this point, but my recollection is that this UPDATING entry was not present as of r347183.  In any case, I do remember looking carefully at /boot/loader and /boot/zfsloader* on r347183 after crahman's instructions allowed me to complete a boot into that revision.  loader and zfsloader were distinct, executable binaries, each with its own, distinct inode number.  If that is not what I should have seen at that revision, then installworld didn't take care of it.
Comment 5 Kyle Evans freebsd_committer freebsd_triage 2019-05-31 12:25:00 UTC
(In reply to Scott Bennett from comment #4)

Hi,

Can you snag loader_4th from a recent -CURRENT snapshot and try that as your /boot/loader with other boot block bits updated please?
Comment 6 Scott Bennett 2019-05-31 12:53:59 UTC
Kyle, I will try, but no promises.  I really do not want this system down for very long when I have to reboot it, and it was already down about a week over this issue once.  It may have to wait until the kernel has eaten all the available page frames again (i.e., the next necessary reboot) after I get the error output for Kubilay.
Comment 7 Kyle Evans freebsd_committer freebsd_triage 2019-05-31 13:16:00 UTC
(In reply to Scott Bennett from comment #6)

I'm afraid the output you get from loader is likely nonsensical (based on your description in comment 0 and that it's way too early in loader) -- IMO, it will be a more effective use of your downtime to try new loader instead so we can figure out if there's something special that hasn't been fixed on head/ or if I simply missed a critical bit from one of the larger MFCs in the noted timespan.
Comment 8 Scott Bennett 2019-10-20 12:49:52 UTC
This bug has been fixed for a long time now.  Unfortunately, the graphics stack broke for my system at about the same time as the fix, leaving me with no functional graphics or graphical web browser.  However, those are now working again.  In any case, this bug report needs to be closed.