Bug 241118

Summary: [boot] 12.1-BETA3 installer hangs before loader menu
Product: Base System Reporter: Ryan Moeller <ryan>
Component: miscAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me CC: cr+freebsd, emaste, sigsys, tsoome
Priority: ---    
Version: 12.1-RELEASE   
Hardware: amd64   
OS: Any   
Bug Depends on:    
Bug Blocks: 240700    

Description Ryan Moeller 2019-10-07 18:57:16 UTC
FreeBSD 12.0-RELEASE disc1.iso written to a USB stick is able to boot on a FreeNAS Mini XL+, but the 12.1-BETA3 amd64 disc1.iso hangs in the spinner after printing "Consoles: EFI console" and before the loader menu is displayed. It does this on any USB port and with extra features such as EFI network stack disabled in firmware.
Doing a legacy boot, the screen is cleared nothing gets displayed, not even the spinner. 
I tried the snapshot 12.1-PRERELEASE 20190906 r351916 disc1.iso and it fails as well, so the break must have been introduced before that.
Comment 1 Ed Maste freebsd_committer freebsd_triage 2019-10-08 14:42:47 UTC
Are you able to try a few other snapshots to narrow down the breakage?
Comment 2 Ryan Moeller 2019-10-08 14:57:39 UTC
(In reply to Ed Maste from comment #1)
That was the earliest snapshot I could find on download.freebsd.org. Is there some place I can find older snapshots?
Comment 3 Ed Maste freebsd_committer freebsd_triage 2019-10-08 15:01:01 UTC
(In reply to Ryan Moeller from comment #2)
There are also CI snapshots available at
https://artifact.ci.freebsd.org/snapshot/stable-12/
Comment 4 Ryan Moeller 2019-10-08 15:26:47 UTC
(In reply to Ed Maste from comment #3)
Thanks, but I don't see install CD images there. I found FreeBSD-12.0-STABLE-amd64-20190411-r346111-mini-memstick.img in my collection so I'll give that a try.
Comment 5 Ryan Moeller 2019-10-08 15:30:31 UTC
(In reply to Ryan Moeller from comment #4)
I also found FreeBSD-12.0-STABLE-amd64-20190425-r346638-disc1.iso on another server, will try that as well.
Comment 6 Ryan Moeller 2019-10-08 15:37:46 UTC
(In reply to Ryan Moeller from comment #5)
r346638 works.
Comment 7 Ed Maste freebsd_committer freebsd_triage 2019-10-08 15:56:17 UTC
(In reply to Ryan Moeller from comment #4)
I believe you should be able to uncompress one of the disk.img.xz images and write to a USB stick to test booting (assuming that it's a general problem in the loader, and not something specific to the install images).
Comment 8 Ryan Moeller 2019-10-08 16:54:18 UTC
(In reply to Ed Maste from comment #7)
Ok, I had to dig through a few folders to find one of those.

r347048
Blank loader screen, but after a little wait the kernel boots with a very low contrast font color. With the other images I have tested, the system would simply reset after sitting for a while.

r348988
Same; blank screen for a while then dim font for kernel.

r350956
All black, no visible activity. System doesn't reset (or I didn't wait long enough), but that might be a difference between the UEFI booted CD image and the legacy booted CI disk images.

I'll work on narrowing it down between r348988 and r350956.
Comment 9 Ryan Moeller 2019-10-09 15:19:41 UTC
r348508 - Blank screen for a while then dim font for kernel.
r348524 - All black, no visible activity.
Comment 10 Ryan Moeller 2019-10-09 16:17:43 UTC
None of the commits to stable/12 between r348508 and r348524 are anywhere near the loader, and the behavior flip-flops again later, so I think both behaviors must be the same problem. I'll have to look even further back.
Comment 11 Ryan Moeller 2019-10-09 19:07:19 UTC
r346844 - No loader, kernel visible
r346774 - No loader, kernel visible

On a hunch I tried the r346638 image from CI and that doesn't boot, it's just a blank screen, even though the disc1.iso for that revision did work. The r340154 image from CI is about as far back as history goes there, and it falls in the "doesn't show anything until the kernel" bucket.

I tried the CI image for r346638 again on a different USB drive and this time it still didn't show the loader, but the kernel did boot with a faint font. Same issue again I guess.

I tested the r346638 disc1.iso I had again, it turns out that UEFI booting works correctly, but legacy booting it exhibits the same problems as the CI images all have (they're all legacy boot only). That's a relief, for a minute I was worried the CI images had the console set to comconsole or something. I confirmed there is no loader.conf or boot.config forcing weird settings on the CI images, to be sure.

So vidconsole at least has been broken on stable/12 for quite a while it seems, but that must not be related to the UEFI boot issue.

Fun fun fun. I can't really narrow it down any further without starting to build my own images. I do need to do other things with this machine for the rest of the day, so that will have to wait.
Comment 12 Ed Maste freebsd_committer freebsd_triage 2019-10-09 19:09:09 UTC
(In reply to Ryan Moeller from comment #11)
Thank you for all of the effort so far in trying to track this down. Unfortunately I have some travel coming up and won't be able to look at it in detail but hopefully either you or someone else will be able to chase details down before the release.
Comment 13 Ryan Moeller 2019-10-10 19:46:36 UTC
I found the issue I was having with vidconsole. It was an incorrect setting in the BIOS: [Advanced > PCIe/PCI/PnP Configuration > Onboard Video OPROM] was set to EFI instead of legacy. With it set to legacy I can see the boot text now. It doesn't fix the original issue though.

For sanity checking (legacy booting images from CI),

r353390.img (latest HEAD) - text visible

Consoles: internal video/keyboard
BIOS drive C: is disk0
BIOS drive D: is disk1ersion is 1.02 (looks like we're missing a screen clear)
BIOS drive E: is disk2
...
BIOS drive M: is disk10
BIOS drive N: is disk11
|

The spinner twiddled for a while then the system hung here. No loader menu.

r353385.img (latest stable/12) - text visible but stalls before menu (same as above, but without the glitch on the drive D: line)


r340154.img (earliest stable/12) - works correctly
r348988.img - works correctly
r351206.img - works correctly
r352298.img - stalls before loader menu

Now this feels like some progress. I'll narrow it down between r351206 and r352298 next.
Comment 14 Ryan Moeller 2019-10-10 21:33:01 UTC
r351752 - stalls
r351504 - stalls
r351358 - works

r351426 - stalls
r351390 - stalls
r351384 - stalls

There are no amd64 images in the CI between r351358 and r351384.

r351384 is a commit to stand/ so I'll try reverting that and building an image to test tomorrow.
Comment 15 Ryan Moeller 2019-10-16 00:28:28 UTC
I built release.iso on releng/12.1 and confirmed it stalls before the loader menu. Then I reverted r351384 and did another build. The second installer boots successfully.
Comment 16 Toomas Soome freebsd_committer freebsd_triage 2019-10-16 06:05:27 UTC
(In reply to Ryan Moeller from comment #15)

Have you attempted boot from current? I wonder if this is something we have fixed already but not merged to 12...

In any case, I'll check over, it will take a bit time.
Comment 17 Ryan Moeller 2019-10-16 13:14:45 UTC
(In reply to Toomas Soome from comment #16)
It is probably broken on HEAD too. The boot fails in the same way. I'll build a test image with the appropriate commit reverted today to confirm it is the same issue.
Comment 18 Ryan Moeller 2019-10-18 13:43:46 UTC
I built an iso from HEAD at r353681 and confirmed the boot stalls, then reverted r350825 and r350772 and built a new iso, which successfully boots.
Comment 19 Toomas Soome freebsd_committer freebsd_triage 2019-10-18 13:55:12 UTC
(In reply to Ryan Moeller from comment #18)

I have been trying to replicate the issue but failed so far. But.. I did now review the messages here, and I guess my test setup is just not replicating what you have.

Could you post or mail me directly the output from zdb run without the arguments.
Comment 20 Ryan Moeller 2019-11-02 22:50:08 UTC
I've run into this problem again trying to boot an 12.1-RC2 installer on a server currently running vanilla FreeBSD 12.0-RELEASE. This one has mirrored SSDs for boot and a pool of 24 disks grouped into mirrors (with two reserved for hot spares). :(

(I have been corresponding with Toomas by email but I wanted to document this publicly as well.
Comment 21 Ryan Moeller 2019-11-07 23:54:22 UTC
Tested latest head snapshot FreeBSD-13.0-CURRENT-amd64-20191107-r354423-disc1.iso on the FreeBSD 12.0 machine.
Still hangs when the pool disks are installed. I see "Consoles: efi" and the spinner spins for a while then gets stuck. If I slide out the storage pool disks the image boots.
Comment 22 Chris R 2019-11-18 13:58:20 UTC
This is happening for me too. I have a SuperMicro X11SDV-4C-TP8F (Xeon-D) server which has two ZFS pools, a system pool consisting of two mirrored SATA SSDs and a data pool which consists of 12 SATA HDDS. The system was running/booting 12.0-RELEASE-p11 with no problems. I started upgrading to 12.1-RELEASE and the machine failed in the same way as the reports in the other comments here, at the "Consoles: EFI console" line. If I remove all the disks for the data pool, the machine boots fine (and I've since finished the 12.1 upgrade), however now the machine is on 12.1 it will refuse to boot if the data pool disks are inserted, so I have to boot the machine with them removed, then manually insert the disks once it's booted.
Comment 23 Chris R 2019-11-19 14:40:44 UTC
Just commenting to confirm that the work-around I'm currently using is to copy the /boot/loader and /boot/loader.efi from 12.0-RELEASE into my /boot. This is definitely a regression in the loader since 12.0-RELEASE.
Comment 24 Ryan Moeller 2019-12-02 22:27:32 UTC
I found zfs_spa_init() in zfsimpl.c is stuck in an infinite loop iterating through a circular list of vdevs. It's not yet clear where the cycle comes from, but it's good to finally have a clue.
Comment 25 Toomas Soome freebsd_committer freebsd_triage 2019-12-19 19:26:05 UTC
Fixed in current r355786, waiting for MFC.
Comment 26 commit-hook freebsd_committer freebsd_triage 2019-12-22 08:22:49 UTC
A commit references this bug:

Author: tsoome
Date: Sun Dec 22 08:22:03 UTC 2019
New revision: 356003
URL: https://svnweb.freebsd.org/changeset/base/356003

Log:
  MFC r354283, r354323, r354363, r354364, r354593, r355773, r355786:

  loader: we do not support booting from pool with log device
  loader: factor out label and uberblock load from vdev_probe, add MMP checks
  loader: populate nvl with data even when label_txg is 0
  loader: clean up the noise around log device
  loader: memory leak in vdev_label_read_config()
  loader: zfsimpl.c cstyle cleanup
  loader: rewrite zfs vdev initialization

  In some cases the pool discovery will get stuck in infinite loop while setting
  up the vdev children.

  To fix, we split the vdev setup into two parts, first we create vdevs based on
  configuration we do get from pool label, then, we process pool config from MOS
  and update the pool config if needed.

  This patch bundle is work leading to and including fix for issue when
  in some cases the pool configuration build does end up in infinite loop.

  PR:		241118
  Reported by:	Ryan Moeller

Changes:
_U  stable/12/
  stable/12/stand/libsa/zfs/zfsimpl.c
  stable/12/sys/cddl/boot/zfs/zfsimpl.h
  stable/12/sys/cddl/boot/zfs/zfssubr.c