Bug 193758 - either gptzfsboot or zfsloader hangs during boot after kernel and pool upgrade
Summary: either gptzfsboot or zfsloader hangs during boot after kernel and pool upgrade
Status: Closed Feedback Timeout
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 8.4-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: loader, needs-qa
Depends on:
Blocks:
 
Reported: 2014-09-18 20:41 UTC by mark
Modified: 2023-09-11 07:14 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description mark 2014-09-18 20:41:11 UTC
On June 29, I updated an amd64 system with a GPT ZFS root, to FreeBSD 8.4-RELEASE-p13 #0 r268016: Sun Jun 29 12:58:11 UTC 2014

I rebooted and this worked without issue. On Sep 9, 2014, I upgraded both pools to the latest for 8.4-RELEASE and so the version property no longer applies.

$ zpool get version 
NAME   PROPERTY  VALUE    SOURCE
pool0  version   -        default
pool1  version   -        default

$ zpool status
  pool: pool0
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	pool0       ONLINE       0     0     0
	  da0p3     ONLINE       0     0     0

errors: No known data errors

  pool: pool1
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	pool1       ONLINE       0     0     0
	  da1p1     ONLINE       0     0     0

errors: No known data errors

$ gpart show
=>       34  312477629  da0  GPT  (149G)
         34        128    1  freebsd-boot  (64k)
        162   67108864    2  freebsd-swap  (32G)
   67109026  245368637    3  freebsd-zfs  (117G)

=>       34  312477629  da1  GPT  (149G)
         34  312477629    1  freebsd-zfs  (149G)


I applied the recommended update to gptzfsboot and the pmbr as follows

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

On Sep. 16, I rebooted the system to find that it no longer booted and the system froze right after the first "/" symbol. I.e. the "spinner" did not spin. Even CTRL-ALT-DEL was insufficient to break it out of the frozen state, a reset or power cycle was required.

We used a bootable 8.4 USB image to get access to the system after it became clear there was no way to boot from the internal drives. Using the fixit shell, we determined the pools were intact and undamaged and reapplied the bootcode again from the USB image as a speculative measure, all to no effect.

The underlying block devices are 3ware RAID controller volumes as follows:

$ egrep da[01] /var/run/dmesg.boot
da0 at twa0 bus 0 scbus0 target 0 lun 0
da0: <AMCC 9650SE-4LP DISK 3.08> Fixed Direct Access SCSI-5 device 
da0: 100.000MB/s transfers
da0: 152577MB (312477696 512 byte sectors: 255H 63S/T 19450C)
da1 at twa0 bus 0 scbus0 target 1 lun 0
da1: <AMCC 9650SE-4LP DISK 3.08> Fixed Direct Access SCSI-5 device 
da1: 100.000MB/s transfers
da1: 152577MB (312477696 512 byte sectors: 255H 63S/T 19450C)

At this point, we've configured a USB stick to handle the boot phase and so that's the workaround, but this seems like quite an extreme failure mode for gptzfsboot that could probably do with some attention.

I've not yet attempted to replicate this on any other system and here's the CPU/RAM for this one:

CPU: Intel(R) Xeon(R) CPU           E5410  @ 2.33GHz (2327.51-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x10676  Family = 6  Model = 17  Stepping = 6
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0xce3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant
real memory  = 17716740096 (16896 MB)
avail memory = 16535756800 (15769 MB)

It's a TYAN Tempest i5100X S5375 motherboard with the following dmidecode details for the BIOS:

        Vendor: American Megatrends Inc.
        Version: 080014 
        Release Date: 01/30/2008

So, where is the boot process likely to have hung if it was so early and what does it mean if CTRL-ALT-DEL is ineffective?
Comment 1 mark 2014-09-18 21:17:37 UTC
This forum posting sounds very close to what I found.

https://forums.freebsd.org/viewtopic.php?&t=42419

So I'll also point out I have the following in /boot.config

-hD 115200

Obviously the serial console worked fine with the prior version of gptzfsboot (from 8.2)
Comment 2 mark 2014-09-25 20:46:24 UTC
I've someone e-mail me indicating they had the same issue and that using the gptzfsboot code from the 9.2 release resolved the issue. I've not yet had a chance to try that, but as far as I can tell the only difference in zfsboot between 8.4 and 9.2 were a couple of lines relating to serial console handling, suggesting like the forum comment that gptzfsboot is hanging in some serial console related code.

So, if this is the root of the issue I would guess that some serial console code for boot loaders changed between 8.2 and 8.4, but that zfsboot, at least wasn't updated until 9.2 or earlier.
Comment 3 mark 2014-09-25 21:00:22 UTC
http://freebsd.1045724.n5.nabble.com/Upgrade-of-RELENG-8-ZFS-boot-pool-leads-to-unbootable-system-td5774091.html  also talks about the /boot.config line interfering with booting.
Comment 4 mark 2014-10-29 21:02:27 UTC
Following a suggestion from Matt Reimer, I've updated the bootcode

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

but using a gptzfsboot from FreeBSD 9.2-release.

So my immediate problem is resolved, but it does mean there's a bug in the gptzfsboot for FreeBSD 8.4 at least.
Comment 5 mark 2014-10-29 21:07:10 UTC
So there's a bug here still.
Comment 6 Andrey V. Elsukov freebsd_committer freebsd_triage 2014-10-29 21:24:40 UTC
(In reply to mark from comment #2)
> chance to try that, but as far as I can tell the only difference in zfsboot
> between 8.4 and 9.2 were a couple of lines relating to serial console
> handling, suggesting like the forum comment that gptzfsboot is hanging in
> some serial console related code.

The difference between 8.x and 9.x+ is very big.
9.x includes this commit and many other fixes
https://svnweb.freebsd.org/base?view=revision&revision=243243
Comment 7 mark 2014-10-29 21:48:20 UTC
Ok, I was looking at the difference between 8.4 and 9.2 just for the zfsboot directory. There is more to gptzfsboot than just zfsboot and I'd didn't look at all the contributions separately.

$ svn diff https://svn0.eu.freebsd.org/base/releng/8.4/sys/boot/i386/zfsboot https://svn0.eu.freebsd.org/base/releng/9.2/sys/boot/i386/zfsboot 
Index: zfsboot.c
===================================================================
--- zfsboot.c	(.../8.4/sys/boot/i386/zfsboot)	(revision 273837)
+++ zfsboot.c	(.../9.2/sys/boot/i386/zfsboot)	(revision 273837)
@@ -54,7 +54,7 @@
 #define NOPT		14
 #define NDEV		3
 
-#define BIOS_NUMDRIVES		0x475
+#define BIOS_NUMDRIVES	0x475
 #define DRV_HARD	0x80
 #define DRV_MASK	0x7f
 
@@ -489,7 +489,12 @@
      * will find any other available pools and it may fill in missing
      * vdevs for the boot pool.
      */
-    for (i = 0; i < *(unsigned char *)PTOV(BIOS_NUMDRIVES); i++) {
+#ifndef VIRTUALBOX
+    for (i = 0; i < *(unsigned char *)PTOV(BIOS_NUMDRIVES); i++)
+#else
+    for (i = 0; i < MAXBDDEV; i++)
+#endif
+    {
 	if ((i | DRV_HARD) == *(uint8_t *)PTOV(ARGS))
 	    continue;
 
@@ -780,8 +785,10 @@
 	    }
 	    ioctrl = OPT_CHECK(RBX_DUAL) ? (IO_SERIAL|IO_KEYBOARD) :
 		     OPT_CHECK(RBX_SERIAL) ? IO_SERIAL : IO_KEYBOARD;
-	    if (ioctrl & IO_SERIAL)
-	        sio_init(115200 / comspeed);
+	    if (ioctrl & IO_SERIAL) {
+	        if (sio_init(115200 / comspeed) != 0)
+		    ioctrl &= ~IO_SERIAL;
+	    }
 	} if (c == '?') {
 	    dnode_phys_t dn;
 
Index: Makefile
===================================================================
--- Makefile	(.../8.4/sys/boot/i386/zfsboot)	(revision 273837)
+++ Makefile	(.../9.2/sys/boot/i386/zfsboot)	(revision 273837)
@@ -16,7 +16,6 @@
 
 CFLAGS=	-DBOOTPROG=\"zfsboot\" \
 	-O1 \
-	-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 \
 	-DBOOT2 \
 	-DSIOPRT=${BOOT_COMCONSOLE_PORT} \
 	-DSIOFMT=${B2SIOFMT} \
@@ -78,7 +77,7 @@
 
 SRCS=	zfsboot.c
 
-.if ${MACHINE_ARCH} == "amd64"
+.if ${MACHINE_CPUARCH} == "amd64"
 beforedepend zfsboot.o: machine
 CLEANFILES+=	machine
 machine:
@@ -86,3 +85,7 @@
 .endif
 
 .include <bsd.prog.mk>
+
+# XXX: clang integrated-as doesn't grok .codeNN directives yet
+CFLAGS.zfsldr.S=	${CLANG_NO_IAS}
+CFLAGS+=		${CFLAGS.${.IMPSRC:T}}
Comment 8 Graham Perrin freebsd_committer freebsd_triage 2022-04-09 06:55:23 UTC
With OpenZFS and FreeBSD 13.0-RELEASE or greater, is this still an issue?