Bug 270805 - loader.efi: crashes with USB device attached
Summary: loader.efi: crashes with USB device attached
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 13.2-RELEASE
Hardware: arm64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash, loader
Depends on:
Blocks:
 
Reported: 2023-04-12 21:17 UTC by Robert Clausecker
Modified: 2023-04-28 22:25 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Clausecker freebsd_committer freebsd_triage 2023-04-12 21:17:38 UTC
I am running FreeBSD 13.2 on my Windows 2023 Dev Kit.  I had to apply D37765 and cherry pick D38031 to get this to work.  hw.pac.enable=0 is needed in /boot/loader.conf to make the kernel boot.  I have installed FreeBSD on ZFS on the internal NVMe SSD.

Now I have noticed the following problem: if I have any USB storage device attached during boot, the loader crashes with a synchronous exception after displaying the beastie menu.  I have unfortunately not managed to capture the address shown.

With no USB storage device attached, the device boots just fine.

Due to the device now being in a data center I will not be able to do any boot loader testing.  I can however provide you with any other information you might need and give you copies of the binaries involved.
Comment 1 Warner Losh freebsd_committer freebsd_triage 2023-04-13 18:50:52 UTC
Is it any USB device? Or one particular one? I boot 20 times a day with USB storage devices attached to my UEFI FreeBSD test machine sometimes... So there's something different about your setup we need to understand.
Comment 2 Robert Clausecker freebsd_committer freebsd_triage 2023-04-13 21:22:41 UTC
(In reply to Warner Losh from comment #1)

It happened with two USB sticks and with a SATA M.2 SSD in an M.2-SATA to USB adapter.  USB keyboard worked fine.

The SATA M.2 SSD had no partition table (gpart destroy was executed in a GPT partition table), the USB sticks had GPT partition tables and various file systems (FAT32 and UFS I think).
Comment 3 Mark Millard 2023-04-22 01:14:21 UTC
Confirmed with main FreeBSD USBC boot media attached. This is
media I use to boot 3 types of Cortex-A72 systems, 2 by UEFI/ACPI
(HoneyComb, MACCHIATObin Double Shot), and 1 by U-Boot (RPi4Bs).
Also: 2 Cortex-A53 U-Boot based systems, a RPi3B and RPi2B
v1.2 . Also: ZFS root media. The media has main [so: 14] and
has the commits for D37765 and D38031 . I had put in place
hw.pac.enable=0 aS well.

Result on the Windows Dev Kit 2023 was a quick:

Synchronous Exception 0x0000000092F922FC


The sequence for the first Dev Kit Power On was:

1st: Power on with UEFI button and set the UEFI to use no secure boot keys.
     Set USB as the first context to try to boot from/via.

2nd: Try to boot with the FreeBSD media attached to USBC.
     Result: the exception.

(So Windows 11 Pro had never been started yet and the internal
media was undisturbed.)

Everything I've tried to boot that USBC media has gotten the same
result. I've not found any UEFI settings that make a difference.

Later I'll try some blank USBC media and possibly other variations
to see if something more specific about the media content leads to
the specific failure report vs. possibly other failures.

I'll note that I've no plan to remove the Windows Pro 11 from the
internal media. I want to be able to swap external media and boot, 
such as switching between ZFS and UFS based systems. I'll also
probably try FreeBSD in Hyper-V at some point.
Comment 4 Mark Millard 2023-04-22 02:09:11 UTC
(In reply to Mark Millard from comment #3)

I took a as-factory-shipped example of the type of USBC
media (but 1TiByte instead of 2 TiByte) and tried to
boot with the media attached (but no OS or such present
on the media).

No Synchronous Exception

Eventually it booted to the internal Windows 11 Pro
(despite UEFI having the internal media unchecked)

So it appears that the Synchronous Exception that
I've gotten on the original media is a response to
something on/from that the media. Merely having
non-boot media connected did not have a problem.
Comment 5 Mark Millard 2023-04-22 02:27:29 UTC
(In reply to Mark Millard from comment #4)

I booted teh media in question on another machine
and did:

# mv /boot/efi/EFI /boot/efi/EFI-disabled

Then I tried booting the Dev Kit machine with
the media connected:

No Synchronous Exception

Eventually it booted to the internal Windows 11 Pro

So it appears that the FreeBSD boot loader is
involved in the problem.


Side note: I've now also tried a USB3-A style port
instead of USB3-C. Both types of ports get the
issue. (Given some Microsft wording I was not sure
that ports USB3-A were involved in potential booting.
They are.)

FYI:

# ls -Tld /boot/efi/EFI/*/*
-r-xr-xr-x  1 root  wheel  865292 Mar 15 21:30:46 2023 /boot/efi/EFI/BOOT/bootaa64.efi
-rwxr-xr-x  1 root  wheel  865292 Mar 15 21:30:46 2023 /boot/efi/EFI/FREEBSD/loader.efi

They still match what is on my normal environment that
still predates the openzfs import disaster.
Comment 6 Mark Millard 2023-04-22 05:35:41 UTC
(In reply to Mark Millard from comment #5)

Well, "has the commits for D37765 and D38031":
built/installed only.
I needed to update the .efi files on the msdosfs. (Done now.)

That gets things to where the FreeBSD kernel activity
gets to the point of the root file system mount, no more
exceptions.

But it ends up complaining that it can not find the pool
label for 'zroot'. (Lots of times.) So it fails to mount
the root file system.

Earlier there are a few ACPI errors/warnings I see when I
scroll back (only on screen, no serial console):

ACPI Error: AE_NOT_FOUND, While resolving a named reference package element -\_SB_.UBF0.PRT0 (20221020/dspkginit-605)
ACPI Error: AE_NOT_FOUND, While resolving a named reference package element -\_SB_.UBF0.PRT1 (20221020/dspkginit-605)

ACPI Warning: \_SB.GPU._CLS: Return Package is too small - found 1 element, expected 3 (20221020/nsprepkg-511)

can't fetch resource for \_SB|.ADC1 - AE_AML_INVALID_RESOURCE_TYPE
Comment 7 Mark Millard 2023-04-22 08:28:10 UTC
(In reply to Mark Millard from comment #6)

I found the distinction that controls failure vs.
success in booting via the USB3 ports:

USB3-C ugen0.5: <GenesysLogic USB3.2 Hub> at usbus0 ports:
ZFS and UFS boots fail.

USB3-A ugen0.1: <Generic XHCI root HUB>   at usbus0 ports:
ZFS and UFS boots work.

Looks like the FreeBSD kernel does not handle USB3.2
(but the UEFI/ACPI does for the FreeBSD loader).

This may make the Windows Dev Kit 2023 a useful context
for development work on handling more modern USB3.*'s.

I'll note that https://learn.microsoft.com/en-us/windows/arm/dev-kit/
reports:

QUOTE
When connecting an external keyboard or mouse, use the USB-A ports,
not USB-C. Using USB-C to connect a keyboard or mouse will only work
intermittently.
END QUOTE

(It is unclear if that is a Windows specific issue, UEFI issue,
both, or more.)


For reference for the UFS USB3-C boot failures, the messages
are:

Mounting from ufs:/dev/gpt/CA72USBufs failed with error 22;
retrying for 10 more seconds
Mounting from ufs:/dev/gpt/CA72USBufs failed with error 22;
invalid fstype.
Comment 8 Mark Millard 2023-04-22 09:07:33 UTC
Just FYI: A problem that I've noticed is:

# date
Wed Dec 31 16:50:41 PST 1969

despite /etc/rc.conf having:

ntpd_enable="YES"
ntpd_sync_on_start="YES"

and it working booting other machines.
Comment 9 Mark Millard 2023-04-22 09:32:05 UTC
Another FYI of an oddity (during a buildworld):

# sysctl -a | grep "temp.*[0-9]C$"
hw.acpi.thermal.tz31.temperature: -273.1C
hw.acpi.thermal.tz30.temperature: -273.1C
hw.acpi.thermal.tz29.temperature: -273.1C
hw.acpi.thermal.tz28.temperature: -273.1C
hw.acpi.thermal.tz27.temperature: -273.1C
hw.acpi.thermal.tz26.temperature: -273.1C
hw.acpi.thermal.tz25.temperature: -273.1C
hw.acpi.thermal.tz24.temperature: -273.1C
hw.acpi.thermal.tz23.temperature: -273.1C
hw.acpi.thermal.tz22.temperature: -273.1C
hw.acpi.thermal.tz21.temperature: -273.1C
hw.acpi.thermal.tz20.temperature: -273.1C
hw.acpi.thermal.tz19.temperature: -273.1C
hw.acpi.thermal.tz18.temperature: -273.1C
hw.acpi.thermal.tz17.temperature: -273.1C
hw.acpi.thermal.tz16.temperature: -273.1C
hw.acpi.thermal.tz15.temperature: -273.1C
hw.acpi.thermal.tz14.temperature: -273.1C
hw.acpi.thermal.tz13.temperature: -273.1C
hw.acpi.thermal.tz12.temperature: -273.1C
hw.acpi.thermal.tz11.temperature: -273.1C
hw.acpi.thermal.tz10.temperature: -273.1C
hw.acpi.thermal.tz9.temperature: -273.1C
hw.acpi.thermal.tz8.temperature: -273.1C
hw.acpi.thermal.tz7.temperature: -273.1C
hw.acpi.thermal.tz6.temperature: -273.1C
hw.acpi.thermal.tz5.temperature: -273.1C
hw.acpi.thermal.tz4.temperature: -273.1C
hw.acpi.thermal.tz3.temperature: -273.1C
hw.acpi.thermal.tz2.temperature: -273.1C
hw.acpi.thermal.tz1.temperature: -273.1C
hw.acpi.thermal.tz0.temperature: -273.1C
Comment 10 Robert Clausecker freebsd_committer freebsd_triage 2023-04-22 10:17:19 UTC
(In reply to Mark Millard from comment #9)

The thermal zones for some reason do not have registers to read the temperature from.  Hence some internal interface returns -1, which is converted into slightly below absolute zero.

Your information about USB 2 vs USB 3 is interesting.  I did my previous testing with a SATA HDD attached to an USB to SATA adapter, which should be using USB 3.  Will test with that one again next week.
Comment 11 Robert Clausecker freebsd_committer freebsd_triage 2023-04-22 10:23:59 UTC
(In reply to Robert Clausecker from comment #10)

Previous testing, as in, before I finally set up the machine.  After setting it up, I tried attaching a variety of other USB disks that may have all supported a later protocol level.

However, in one of my previous attempts (http://fuz.su/~fuz/files/volterra-dmesg-7.log), you can clearly see that the boot disk is attached via the integrated USB 3.2 hub (however, it was a USB A port):

ugen0.4: <GenesysLogic USB3.2 Hub> at usbus0
uhub2 on uhub0
uhub2: <GenesysLogic USB3.2 Hub, class 9/0, rev 3.20/61.24, addr 3> on usbus0
uhub2: 4 ports with 3 removable, self powered
(...)
ugen0.6: <ASMedia AS2115> at usbus0
umass0 on uhub2
umass0: <ASMedia AS2115, class 0/0, rev 3.00/0.01, addr 5> on usbus0
umass0:  SCSI over Bulk-Only; quirks = 0x0100
umass0:1:0: Attached to scbus1
da0 at umass-sim0 bus 0 scbus1 target 0 lun 0
da0: <ASMT 2115 0> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number 00000000000000000000
da0: 400.000MB/s transfers
da0: 152627MB (312581808 512 byte sectors)
da0: quirks=0x2<NO_6_BYTE>
da0: Delete methods: <NONE(*),ZERO>
GEOM: new disk da0

The 1969 date is because FreeBSD does not detect an RTC clock.  I'm not sure if the machine has one; there's no battery inside, so how would it keep the date?  See dmesg log:

Warning: no time-of-day clock registered, system time will not be set accurately
Comment 12 Mark Millard 2023-04-22 12:54:03 UTC
(In reply to Robert Clausecker from comment #10)

I did not write anything about USB2, only USB3.? .

The issue is USB3.0 vs USB3.2 for the hardware in the
Windows Dev Kit 2023 (WDK23) hubs/ports and its handling by
the FreeBSD kernel.

I used the exact same drive connected to different
places on the WDK23.
Comment 13 Robert Clausecker freebsd_committer freebsd_triage 2023-04-22 12:58:34 UTC
(In reply to Mark Millard from comment #12)

Weird.  I only ever tried to connect to the USB A ports.  Maybe the USB A ports are not all the same?
Comment 14 Mark Millard 2023-04-22 13:16:38 UTC
(In reply to Robert Clausecker from comment #11)

All my ZFS testing was with the same drive in different ports.

All my UFS testing was with the same drive in different ports.

(ZFS drive vs. UFS drive: same type but distinct instances.)

The 2 drives are USB3.2 capable but are compatible/capable
with USB3.0 (and with USB2). In this context, the WDK23
interhal hubs and ports are a mix of USB3.0 and USB3.2 .

As I understand it, even for a USB3.0 device, when attached
to a USB 3.2 hub/port the kernel has somewhat different
activity to do. The hub/port is not fully transparent of
itself.

(May be you were referencing my keyboard/mouse note, where
I did not reference USB2 explicitly. I do not expect that
any keyboards/mice issues are relevant to the storage media
issues.)

As for the time:

RPi4B's do not have an RTC but the ntpd startup I use
deals with setting up the time anyway. That did not happen
here. I'm unsure why. I ended up manually setting the date
in order to allow my buildworld buildkernel test.

(Again, I sometimes boot the RPi4B's with the same drives
that I used for the Windows Dev Kit 2023 testing.)


As for temperature: If what you report is true, it is odd
that the UEFI/ACPI implementation supplies definitions for
non-existing sensors.
Comment 15 Mark Millard 2023-04-22 13:52:17 UTC
(In reply to Robert Clausecker from comment #13)

Looking again at the log for the successful boot that I was
referencing, it is not as I said (from grep for usb/uhub
references):

Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub0 on usbus0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub0: <Generic XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub0: 6 ports with 6 removable, self powered
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub2 on uhub0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub2: <GenesysLogic USB3.2 Hub, class 9/0, rev 3.20/61.24, addr 4> on usbus0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub2: 4 ports with 3 removable, self powered
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: umass0 on uhub2
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: umass0: <Samsung PSSD T7 Touch, class 0/0, rev 3.20/1.00, addr 6> on usbus0

(That was a USB-A connection.)

(Unfortunately, no logs from the failing contexts. Tomorrow
I can likely scroll back on screen and find and record where
it reports umass0 as being when I use USB-C, tracing back to
where the XHCI root HUB is.)

But your log's subsequence is the same:

uhub0 on usbus0
uhub0: <Generic XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
usbus0: 5.0Gbps Super Speed USB v3.0
uhub0: 6 ports with 6 removable, self powered
uhub2: <GenesysLogic USB3.2 Hub, class 9/0, rev 3.20/61.24, addr 3> on usbus0
uhub2: 4 ports with 3 removable, self powered
uhub2 on uhub0
uhub2: <GenesysLogic USB3.2 Hub, class 9/0, rev 3.20/61.24, addr 3> on usbus0
umass0 on uhub2
umass0: <ASMedia AS2115, class 0/0, rev 3.00/0.01, addr 5> on usbus0
Comment 16 Mark Millard 2023-04-22 13:57:05 UTC
(In reply to Robert Clausecker from comment #13)

Looking again at the log for the successful boot that I was
referencing, it is not as I said (from grep for usb/uhub
references) and what varies between your log and mine is
something else: Mine was a USB3.2 device on the USB3.2
hub but yours was a USB3.0 device on the USB3.2 hub.

My backtrace from umass0 to Generic XHCI root HUB:

Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub0 on usbus0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub0: <Generic XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub0: 6 ports with 6 removable, self powered
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub2 on uhub0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub2: <GenesysLogic USB3.2 Hub, class 9/0, rev 3.20/61.24, addr 4> on usbus0
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: uhub2: 4 ports with 3 removable, self powered
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: umass0 on uhub2
Dec 31 16:00:24 CA72_4c8G_ZFS kernel: umass0: <Samsung PSSD T7 Touch, class 0/0, rev 3.20/1.00, addr 6> on usbus0
(Note that last "rev 3.20".)

Your backtrace from umass0 to Generic XHCI root HUB:

uhub0 on usbus0
uhub0: <Generic XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
usbus0: 5.0Gbps Super Speed USB v3.0
uhub0: 6 ports with 6 removable, self powered
uhub2: <GenesysLogic USB3.2 Hub, class 9/0, rev 3.20/61.24, addr 3> on usbus0
uhub2: 4 ports with 3 removable, self powered
uhub2 on uhub0
uhub2: <GenesysLogic USB3.2 Hub, class 9/0, rev 3.20/61.24, addr 3> on usbus0
umass0 on uhub2
umass0: <ASMedia AS2115, class 0/0, rev 3.00/0.01, addr 5> on usbus0
(Note that last "rev 3.00".)
Comment 17 Mark Millard 2023-04-22 13:58:58 UTC
(In reply to Mark Millard from comment #16)

FYI:
The 2 "backtraces" are presented in forward-time order, so
backtracing is reading each bottom-to-top.
Comment 18 Mark Millard 2023-04-22 14:33:36 UTC
(In reply to Mark Millard from comment #17)

I set up the USB-C connection context again and am looking on
screen at the scroll back for the failure:

No "umass*" ever shows up for my failing context (USB-C in
use). (But the FreeBSD loader loaded the kernel from the
drive just fine via UEFI's drive I/O support.)

This does not match your failure's log. So there may be
2 distinct problems for our 2 failures.

After sleeping, I may try to set up a USB3.0/USB-A context
to see if I can replicate your failure. Probably using UFS.

By the way, in the UEFI, what is the boot order you have
it using for finding a boot media? Did you disable any
of the options? Which? I moved USB to the top (first) and
disabled the others. (It still eventually boots Windows 11
Pro if no USB EFI loader is found.)
Comment 19 Robert Clausecker freebsd_committer freebsd_triage 2023-04-22 15:02:25 UTC
(In reply to Mark Millard from comment #18)

I tried various boot orders and none of them changed the result.  I believe that once the boot loader is successfully load by UEFI, the boot order ceases to be of importance.
Comment 20 Mark Millard 2023-04-22 23:17:21 UTC
(In reply to Robert Clausecker from comment #0)

Ultimately, using main [so: 14], I've  not been able to
reproduce any "after displaying the beastie menu" crashes
based on USB storage having been connected during the
boot. USB3.2 and USB3.0 devices. I cover 3 of the 4
combinations relative to port types for my test context:

USB3.2 in USB-C port (no "umass0" or no "umass1" or . . .)
USB3.2 in USB-A port (works)
USB3.0 in USB-A port (works)

Unfortunately, I do not have a way to form a
USB3.0/USB-C combination.

The USB3.2 in USB-C port case has differing consequences
for boot media (no root mount or the like) vs. having the
same result as not plugging the drive into a port at
all: not detected.

(In my context, the FreeBSD boot media is always a
"umass*".)

I no longer maintain an environment for building stable/*
or releng/* variants. So it may be a main vs. releng
distinction compared to your results. I've not checked.

I'll also note that I do not have access to a variety of
media of the types listed. It could be some other distinction
is involved that happens to correlate with USB3.2 in my
context and that some other USB3.2 storage media would work.
I've no way to know from what I've available to test.

(Of course, a bunch of my comments ended up being the process
of figuring my own operator error: All those reporting a
Synchronous Exception.)
Comment 21 Mark Millard 2023-04-22 23:29:25 UTC
(In reply to Robert Clausecker from comment #19)

Note:

If you get ahold of a main [so: 14] loader.efi copy that
has the 2 required commits, you could try substituting
that loader.efi content into your 13.2-RELEASE media and
see if you then end up with what I've reported.

I've not checked the latest snapshot for those commits but
at some point extracting a loader.efi from a main snapshot
would allow the experiment (before stable or release had
such).
Comment 22 Robert Clausecker freebsd_committer freebsd_triage 2023-04-22 23:47:26 UTC
(In reply to Mark Millard from comment #21)

Unfortunately the device is colocated now and hard for me to access.  It's also busy 24/7 building ports.  I'll see if I can find an opportunity to perform these tests.
Comment 23 Mark Millard 2023-04-23 01:21:10 UTC
(In reply to Mark Millard from comment #20)

I've submitted:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=271012

against main's kernel for the failure context that
I ran into (since it does not match yours as far as
we can tell so far).
Comment 24 Mark Millard 2023-04-23 08:16:23 UTC
(In reply to Robert Clausecker from comment #22)

FYI:

I have tested one of the latest aarch64 snapshots,
with /boot/loader.conf adjusted, and it worked fine
for booting the Windows Dev Kit 2023 as the snapshot's
first boot.

So the loader.efi and kernel involved look to be
appropriate.

The media was one of the USB3.0/USB-A devices.
Comment 25 Mark Millard 2023-04-27 00:30:22 UTC
(In reply to Mark Millard from comment #20)

In a few days I should have an adapter to connect
the USB3.0 SSDs that have a USB-A connector to
USB-C ports, such as on the Windows Dev Kit 2023.

So I should then be able to test USB3.0 devices
on the USB-C ports instead of only USB3.2 devices
on those ports, including for being present during
booting.

(Still only one type of USB3.0 SSD device, not a
variety. But one type is more than zero types.)

Of course, my test would not be likely to duplicate
the partitioning or content of the drives that got
the crashes. Duplication of some aspects may be
required to see the problem and if my tests do not
get the crash, such would be suggested as a
possibility.
Comment 26 Mark Millard 2023-04-28 22:25:24 UTC
(In reply to Mark Millard from comment #25)

Interestingly, using the adaptor to USB-C for plugging in
media after booting FreeBSD, I get different results on
different systems (the only FreeBSD USB-C contexts that I've
access to):

ThreadRipper 1950X:   media is detected.
Windows Dev Kit 2023: same media is not detected.

The ThreadRipper is a USB 3.1 context for the USB-C connector,
not a USB 3.2 context.

The WDK23 has its note about keyboards and mice via its USB-C
connector which may indicate something relevant.

The media here is a USB3.0 SSD.



As for having the USB-C connection present during boot loader
activity: that did not cause the loader any problems.

The only way that I've ever found to have a synchronous
exception during loader activity is from having plugged in
FreeBSD boot media with too old of a UEFI loader --and for
the UEFI to also have picked that media to get the UEFI
loader from. Then that loader leads to the synchronous
exception.

The UEFI does not give a way to explicitly pick which USB
device to boot from when more than one "bootable" USB media
is present. May be the port scan is in a fixed order or
some such. I've not tested for such.

Overall: I'm unable to reproduce the problem that was
reported.