Bug 229972 - 11.2-RELEASE kernel wont boot with zfs mirror root
Summary: 11.2-RELEASE kernel wont boot with zfs mirror root
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-07-22 21:30 UTC by Patrick Mackinlay
Modified: 2021-05-14 19:23 UTC (History)
4 users (show)

See Also:


Attachments
Verbose boot log (includes "?" output from the mountroot prompt) (97.57 KB, text/plain)
2018-07-23 04:57 UTC, Eugene M. Kim
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick Mackinlay 2018-07-22 21:30:22 UTC
I tried to upgrade from freebsd 11.1 to 11.2. After booting with the 11.2 kernel the boot failed and automatically rebooted again. I cant see the error message because it reboots too quickly. However, it seems to happen when the kernel tries to mount the root file system. My root file system is a zfs mirror. Nothing is written to my log files so for sure the root file system is never mounted.
Comment 1 Eugene M. Kim 2018-07-23 04:34:43 UTC
I noticed the same error yesterday.  I am currently trying to get a full kernel log over a serial console.  Will update shortly.
Comment 2 Eugene M. Kim 2018-07-23 04:57:16 UTC
Created attachment 195385 [details]
Verbose boot log (includes "?" output from the mountroot prompt)

Uploaded a boot -v log.  The important part is at the end:

--- BEGIN dragon ---
Trying to mount root from zfs:hydrogen []...
GEOM: new disk ada1
GEOM: new disk ada2
GEOM: new disk ada3
GEOM: new disk ada4
GEOM: new disk ada5
random: unblocking device.
Mounting from zfs:hydrogen failed with error 2; retrying for 3 more seconds
Mounting from zfs:hydrogen failed with error 2; retrying for 2 more seconds
Mounting from zfs:hydrogen failed with error 2; retrying for 1 more second
Mounting from zfs:hydrogen failed with error 2.

Loader variables:
  vfs.root.mountfrom=zfs:hydrogen

Manual root filesystem specification:
  <fstype>:<device> [options]
      Mount <device> using filesystem <fstype>
      and with the specified (optional) option list.

    eg. ufs:/dev/da0s1a
        zfs:tank
        cd9660:/dev/cd0 ro
          (which is equivalent to: mount -t cd9660 -o ro /dev/cd0 /)

  ?               List valid disk boot devices
  .               Yield 1 second (for background tasks)
  <empty line>    Abort manual input

mountroot> ?

List of GEOM managed disk devices:
  diskid/DISK-Z500CAZ4p2 diskid/DISK-Z500CAZ4p1 gptid/9c8516d4-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-6-root gptid/9c689369-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-6-boot diskid/DISK-Z500CAKLp2 diskid/DISK-Z500CAKLp1 gptid/9b9a22dc-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-5-root gptid/9b7d9f31-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-5-boot diskid/DISK-Z500C9H1p2 diskid/DISK-Z500C9H1p1 gptid/98c41a39-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-2-root gptid/98a79a2f-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-2-boot diskid/DISK-Z500CB0Ap2 diskid/DISK-Z500CB0Ap1 gptid/97bc7621-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-1-root gptid/97763b2e-59b9-11e4-88bf-74d02b1366fc gpt/hydrogen-1-boot diskid/DISK-Z500CAZ4 ada5p2 ada5p1 diskid/DISK-Z500CAKL ada4p2 ada4p1 diskid/DISK-Z500C9H1 ada1p2 ada1p1 diskid/DISK-Z500CB0A ada0p2 ada0p1 diskid/DISK-Z304Z8ZGp2 diskid/DISK-Z304Z8ZGp1 gptid/962dbc0a-08a9-11e6-ae59-74d02b1366fc gpt/hydrogen-4-root gptid/700a5eeb-08a9-11e6-ae59-74d02b1366fc gpt/hydrogen-4-boot diskid/DISK-Z30508VNp2 diskid/DISK-Z30508VNp1 gptid/691970f2-0853-11e6-ae59-74d02b1366fc gpt/hydrogen-3-root gptid/3c36c0de-0853-11e6-ae59-74d02b1366fc gpt/hydrogen-3-boot diskid/DISK-Z304Z8ZG ada3p2 ada3p1 diskid/DISK-Z30508VN ada2p2 ada2p1 ada5 ada4 ada3 ada2 ada1 ada0

mountroot> 

--- END dragon ---
Comment 3 Eugene M. Kim 2018-07-23 05:11:13 UTC
A few notes:

* This kernel is a vanilla VIMAGE kernel (GENERIC + options VIMAGE).
* Loaded modules are shown in the boot log.
* The root pool (hydrogen) is a raidz pool with 6 GPT partitions (ada[0-5]p2).
* The root pool accesses the partitions using the GPT labels (/dev/gpt/hydrogen-[1-6]-root).
* The old kernel (vanilla VIMAGE kernel from 11.1-RELEASE-p10) can still boot from the pool.
Comment 4 Patrick Mackinlay 2018-08-02 06:56:15 UTC
Yesterday I had some time to investigate this further and believe I have found the problem (at least for me).

I created a bhyve vm and installed a simple vanilla FreeBSD 11.1 instance with a single root ZFS pool (nothing special, single partition, no raid or mirror). I then used freebsd-update to bring it up to the latests 11.1 patch level, this booted fine. After that I used freebsd-update to go to 11.2. No problems.

My main desktop (the one that failed the upgrade) has two ZFS pools, a mirror for the base OS and a raidz2 pool (on geli partitions) for my data. I copied the two disks I use for my image partition onto two old spare disks. The zfs partitions I copied using the zfs send/receive functionality. The boot partitions I created from scratch and used the boot code (and partcode) from my 11.2 vm install. This is when I noticed that the gptzfsboot code from 11.2 is different from the 11.1 gptzfsboot code. After a few changes to the vm copies (rc.conf had to be modified for the different network, loader.conf vfs.root.mountfrom had to be changed ...). I booted the copy in my vm. I followed the  freebsd-update process, but note that my install has a custom kernel, so after the final "freebsd-update install" used the old 11.1 kernel. I then built my kernel from source and rebooted the vm. All went well, no issues.

So there are 2 things I did different for the true upgrade and the vm upgrade.
1. I used the latest gptzfsboot code in the vm upgrade
2. I built the custom kernel after the 11.2 base upgrade in the vm. For the non vm I build the new kernel before the base upgrade and then installed it after the base upgrade

One of those two steps fixed the problem. I assume it was using the latest gptzfsboot code that fixed the issue (I always build the new kernel with the old code base (new src) and I have never had problems in the past).

So as far as I am concerned this issue is fixed, although it would be nicer if FreeBSD were a bit more forgiving when you get it wrong. Also I did not see any note about the gptzfsboot code changing in the UPDATING file.
Comment 5 harrison 2018-09-20 03:00:35 UTC
I have a way to fix the problem which has worked for 4 systems after upgrading to 11.2.



I believe it's a race condition on boot with the zfs partitions. Basically the system will try to mount the partitions in no specific order. This causes a problem for the ROOT partition needs to be mounted first then the rest can be mounted. I fixed this by booting to a USB drive and mounting the zfs zroot/default/ROOT partition first then zfs mount -a. After root boot the system came back without issues.
Steps:
1. boot off USB 11.2 disk
2. zpool import -R /mnt <zroot>
3. zfs mount <zroot>/default/ROOT
4. zfs mount -a
5. reboot back into the upgraded OS.
Comment 6 harrison 2018-09-20 03:01:24 UTC
I have a way to fix the problem which has worked for 4 systems after upgrading to 11.2.



I believe it's a race condition on boot with the zfs partitions. Basically the system will try to mount the partitions in no specific order. This causes a problem for the ROOT partition needs to be mounted first then the rest can be mounted. I fixed this by booting to a USB drive and mounting the zfs zroot/default/ROOT partition first then zfs mount -a. After root boot the system came back without issues.
Steps:
1. boot off USB 11.2 disk
2. zpool import -R /mnt <zroot>
3. zfs mount <zroot>/default/ROOT
4. zfs mount -a
5. reboot back into the upgraded OS.
Comment 7 bro.development 2018-10-17 12:23:11 UTC
Is this the bug listed on https://www.freebsd.org/releases/11.2R/errata.html ?

"""
[2017-07-25] A late issue was discovered with FreeBSD/arm64 and "root on ZFS" installations where the root ZFS pool would fail to be located.

There currently is no workaround.
"""

Any hints on how to debug this issue?