Bug 264021 - loader: Fails to boot on arm64 after base 3a9a9c0ca44e: failed to allocate staging area: 9 (EFI_OUT_OF_RESOURCES)
Summary: loader: Fails to boot on arm64 after base 3a9a9c0ca44e: failed to allocate st...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Only Me
Assignee: Andrew Turner
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-16 16:34 UTC by Martin Filla
Modified: 2022-06-17 23:34 UTC (History)
10 users (show)

See Also:
koobs: mfc-stable13?
koobs: mfc-stable12?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Filla 2022-05-16 16:34:35 UTC
Hi,
I builded current source code and boot failed to allocate staging area: 9.
and 9 is EFI_OUT_OF_RESOURCES. I have nanopct4 

Loading kernel...
/boot/kernel/kernel text=0x2a8 text=0x82c250 text=0x24bb6c data=0x1b9ea8 data=0x0+0x34e000 0x8+0x132d08+0x8+0x15aaa2-
Loading configured modules...
can't find '/boot/entropy'
can't find '/etc/hostid'
Using DTB provided by EFI at 0x80e8000.
EFI framebuffer information:
addr, size     0x0, 0x0
dimensions     0 x 0
stride         0
masks          0x00000000, 0x00000000, 0x00000000, 0x00000000
failed to allocate staging area: 9
failed to allocate staging area: 9
failed to allocate staging area: 9
failed to allocate staging area: 9
failed to allocate staging area: 9
failed to allocate staging area: 9
Comment 1 Graham Perrin freebsd_committer freebsd_triage 2022-05-16 18:42:13 UTC
Is there an ESP and if so, how recent is the loader in that partition?
Comment 2 Martin Filla 2022-05-16 19:17:28 UTC
(In reply to Graham Perrin from comment #1)
I build image from this script https://github.com/Martinfx/BuildBoard/blob/main/create_image_nanopc_t4.sh
Comment 3 Danilo Egea Gondolfo freebsd_committer freebsd_triage 2022-05-19 20:15:12 UTC
Same problem here. It looks like it was introduced by the llvm stack update.

I narrowed it down to this commit 3a9a9c0ca44ec535dcf73fe8462bee458e54814b.

The problem doesn't exist at cb2ae6163174b90e999326ecec3699ee093a5d43, the last commit before the llvm merge from vendor.


In my case I'm using a RockPro64 and boot via network:

Consoles: EFI console  
    Reading loader env vars from /efi/freebsd/loader.env
FreeBSD/arm64 EFI loader, Revision 1.1
(Thu May 19 20:51:18 IST 2022 root@capeta)

   Command line arguments: nfsroot=172.16.0.1:/ztank/rockpro64
   Image base: 0xf0dd2000
   EFI version: 2.80
   EFI Firmware: Das U-Boot (rev 8225.1792)
   Console: efi (0x1000)
   Load Path: /boot\loader_lua.efi
   Load Device: /VenHw(e61d73b9-a384-4acc-aeab-82e828f3628b)/MAC(867402fbbe41,1)
Comment 4 Marek Zarychta 2022-05-20 08:15:18 UTC
The culprit is bootaa64.efi AKA loader_lua.efi. Using the older version from March 24 2022 solved the issue in my case. 
It was FreeBSD 14.0-CURRENT #3 main-n255718-f2ab9160844: Fri May 20 08:59:20 CEST 2022  on PINE64LTS that suffered from this issue.
Comment 5 Martin Filla 2022-05-20 14:58:18 UTC
(In reply to Danilo Egea Gondolfo from comment #3)
This commit is works cb2ae6163174b90e999326ecec3699ee093a5d43
Comment 6 Marek Zarychta 2022-05-20 15:09:15 UTC
The file bootaa64.efi got striped a lot. The version from March 24 is 1272140 bytes long, the new version is only 1182412 bytes long. Will try to investigate further in conditions allow.
Comment 7 Andrew Turner freebsd_committer freebsd_triage 2022-05-20 20:07:40 UTC
It looks like the issue is the loop in bi_load_efi_data in bootinfo.c is too smart for clang, so it gets confused and thinks efihdr and mm doesn't get initialised. This causes it to remove all the code after the getenv meaning we return from this getenv call to efi_copy_init.

efi_copy_init then enters an infinite loop allocating all memory until it runs out & complains, however is still in the loop so continues to try and fail to allocate more memory.

I have a local fix I'll push for review soon, but a work around for now seems to be making sure efihdr and mm are initilised to NULL before the comment starting "Matthew Garrett has observed ..."
Comment 8 Jessica Clarke freebsd_committer freebsd_triage 2022-05-20 21:04:16 UTC
I doubt efihdr is the problem. It's likely that the call to BS->GetMemoryMap the first time round the loop, which is guaranteed to be executed, reads an uninitialised mm, and thus we have trivially provably guaranteed UB ("The value of an object with automatic storage duration is used while it is indeterminate"). What value it takes doesn't matter as the first time round the loop we use sz = 0 so, unless the memory map has 0 entries, it's guaranteed to fit, but it must be initialised to something determinate.

Minimal-ish reproducer: https://godbolt.org/z/KTvd73osd
Comment 9 Mark Millard 2022-05-21 01:09:44 UTC
Just FYI: My report at

https://lists.freebsd.org/archives/freebsd-arm/2022-May/001354.html

possibly is the same sort of issue. Both aarch64 and armv7 systems
had the system-llvm14 based build of loader.efi crash very early
in the boot. I reverted just the loader.efi copies that predated
my progressing to the system llvm14 basis (so, back to late April)
and that made things work.

But amd64's update via the same source tree contents did not have
problems using its llvm14 based loader.efi. (No other FreeBSD
platforms around to try.) So I've only evidence for armv7 and
aarch64 problems.

I'll note that the aarch64 context uses an EDK2 based UEFI/ACPI
and the armv7 context uses a U-Boot 2022.04 based UEFI (not ACPI).
Both have been in use for some time before this upgrade activity
and were not changed.

My installed contexts are from non-debug builworld buildkernel
based builds, despite booting main [so: 14]. I could install
debug kernels and see what they report if I also put back the
loader.efi copies built via llvm14. Right now that does not
look to be likely to be useful.

For reference for the failure contexts:

aarch64: MACCHIATObin Double Shot
armv7:   Orange Pi+ 2ed

I have access to more aarch64 contexts and a RPi2 v1.1 (armv7).
But I avoided upgrading the loader.efi copies on any boot media
for either type after the first one of each type got the boot
failures --and I reverted on the media that showed the failure
for each type.
Comment 10 Mark Millard 2022-05-21 02:00:06 UTC
(In reply to Mark Millard from comment #9)

Looking at the information that the armv7 context
reported:

. . .
Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [/boot/kernel/kernel]...               
Using DTB provided by EFI at 0x47edf000.
Kernel entry at 0xb2e00200...
Kernel args: (null)
undefined instruction
pc : [<b8dd34a4>]          lr : [<b8e3128c>]
reloc pc : [<44e3f4a4>]    lr : [<44e9d28c>]
sp : b9f6a328  ip : b69e1c00     fp : b9f6a368
r10: b9f6a374  r9 : 00000000     r8 : b8f1f11c
r7 : c0e03000  r6 : 00008000     r5 : b6981500  r4 : 00000000
r3 : 00000065  r2 : 00000076     r1 : b8f1b847  r0 : 00000000
Flags: nZCv  IRQs off  FIQs off  Mode SVC_32
Code: e08f0000 e1a0e00f ea01776e 00144492 (00146ddf) 
UEFI image [0xb8dd3000:0xb8f2632b] pc=0x4a4 '/efi\boot\bootarm.efi'
Resetting CPU ...

That Code sequence appears at:

Disassembly of section .text:

00000018 <efi_start>:
. . .
00000178 <bi_load>:
. . .
     4ac:       e08f0000        add     r0, pc, r0
     4b0:       e1a0e00f        mov     lr, pc
     4b4:       ea01776e        b       5e274 <getenv>
     4b8:       00144492        .word   0x00144492
. . .

It is also the only place in:

stand/efi/loader_lua/loader_lua.sym.full

with that code sequence. Showing some more context,
including 0x4a4:

. . .
     484:       eb003ae5        bl      f020 <file_addmetadata>
     488:       e59f3058        ldr     r3, [pc, #88]   ; 4e8 <bi_load+0x370>
     48c:       e1a00005        mov     r0, r5
     490:       e3a0100c        mov     r1, #12
     494:       e3a02004        mov     r2, #4
     498:       e79f3003        ldr     r3, [pc, r3]
     49c:       eb003adf        bl      f020 <file_addmetadata>
     4a0:       e1a00005        mov     r0, r5
     4a4:       eb01fec5        bl      7ffc0 <geli_export_key_metadata>
     4a8:       e59f003c        ldr     r0, [pc, #60]   ; 4ec <bi_load+0x374>
     4ac:       e08f0000        add     r0, pc, r0
     4b0:       e1a0e00f        mov     lr, pc
     4b4:       ea01776e        b       5e274 <getenv>
     4b8:       00144492        .word   0x00144492
     4bc:       00146ddf        .word   0x00146ddf
     4c0:       00146dd2        .word   0x00146dd2
     4c4:       0014402f        .word   0x0014402f
     4c8:       0014fb74        .word   0x0014fb74
     4cc:       0014fad8        .word   0x0014fad8
     4d0:       00147610        .word   0x00147610
     4d4:       0014a23b        .word   0x0014a23b
     4d8:       0014a367        .word   0x0014a367
     4dc:       0014b6c4        .word   0x0014b6c4
     4e0:       00144892        .word   0x00144892
     4e4:       001428f8        .word   0x001428f8
     4e8:       0014f8e4        .word   0x0014f8e4
     4ec:       001483ab        .word   0x001483ab
     4f0:       00144c81        .word   0x00144c81
     4f4:       001484fc        .word   0x001484fc

000004f8 <efi_copy_init>:
. . .

The bl to bi_load is in:

0000a2b4 <elf32_arm_exec>:
    a2b4:       e92d4830        push    {r4, r5, fp, lr}
    a2b8:       e28db008        add     fp, sp, #8
    a2bc:       e24dd008        sub     sp, sp, #8
    a2c0:       e3a01002        mov     r1, #2
    a2c4:       e1a05000        mov     r5, r0
    a2c8:       eb00173d        bl      ffc4 <file_findmetadata>
    a2cc:       e3500000        cmp     r0, #0
    a2d0:       0a000016        beq     a330 <elf32_arm_exec+0x7c>
    a2d4:       e1a04000        mov     r4, r0
    a2d8:       eb013172        bl      568a8 <efi_time_fini>
    a2dc:       e5940024        ldr     r0, [r4, #36]   ; 0x24
    a2e0:       ebffd8b0        bl      5a8 <efi_translate>
    a2e4:       e1a04000        mov     r4, r0
    a2e8:       e59f006c        ldr     r0, [pc, #108]  ; a35c <elf32_arm_exec+0xa8>
    a2ec:       e1a01004        mov     r1, r4
    a2f0:       e08f0000        add     r0, pc, r0
    a2f4:       eb01527f        bl      5ecf8 <printf>
    a2f8:       e5951008        ldr     r1, [r5, #8]
    a2fc:       e59f005c        ldr     r0, [pc, #92]   ; a360 <elf32_arm_exec+0xac>
    a300:       e08f0000        add     r0, pc, r0
    a304:       eb01527b        bl      5ecf8 <printf>
    a308:       e5950008        ldr     r0, [r5, #8]
    a30c:       e28d1004        add     r1, sp, #4
    a310:       e1a0200d        mov     r2, sp
    a314:       e3a03001        mov     r3, #1
    a318:       ebffd796        bl      178 <bi_load>
    a31c:       e3500000        cmp     r0, #0
    a320:       0a000006        beq     a340 <elf32_arm_exec+0x8c>
    a324:       e1a05000        mov     r5, r0
    a328:       eb013132        bl      567f8 <efi_time_init>
    a32c:       ea000000        b       a334 <elf32_arm_exec+0x80>
    a330:       e3a0504f        mov     r5, #79 ; 0x4f
    a334:       e1a00005        mov     r0, r5
    a338:       e24bd008        sub     sp, fp, #8
    a33c:       e8bd8830        pop     {r4, r5, fp, pc}
    a340:       eb0011a5        bl      e9dc <dev_cleanup>
    a344:       e59d0004        ldr     r0, [sp, #4]
    a348:       e12fff34        blx     r4
    a34c:       e59f0010        ldr     r0, [pc, #16]   ; a364 <elf32_arm_exec+0xb0>
    a350:       e08f0000        add     r0, pc, r0
    a354:       e1a0e00f        mov     lr, pc
    a358:       ea015251        b       5eca4 <panic>
    a35c:       0013f3bc        .word   0x0013f3bc
    a360:       0013b5f5        .word   0x0013b5f5
    a364:       00139826        .word   0x00139826

May be the above will prompt something about the problem.
Comment 11 Mark Millard 2022-05-21 02:14:29 UTC
(In reply to Mark Millard from comment #10)

Hmm:

void
geli_export_key_metadata(struct preloaded_file *kfp)
{
    struct keybuf *keybuf;

    keybuf = malloc(GELI_KEYBUF_SIZE);
    geli_export_key_buffer(keybuf);
    file_addmetadata(kfp, MODINFOMD_KEYBUF, GELI_KEYBUF_SIZE, keybuf);
    explicit_bzero(keybuf, GELI_KEYBUF_SIZE);
    free(keybuf);
}

No possibility of malloc failure and a bad-to-use
keybuf value? (But I'm not literate about the libsa
expectations for how things operate.)
Comment 12 commit-hook freebsd_committer freebsd_triage 2022-05-21 11:06:54 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0d6600b579be769b85f049ef421023316f21b5c3

commit 0d6600b579be769b85f049ef421023316f21b5c3
Author:     Andrew Turner <andrew@FreeBSD.org>
AuthorDate: 2022-05-21 10:45:41 +0000
Commit:     Andrew Turner <andrew@FreeBSD.org>
CommitDate: 2022-05-21 10:45:41 +0000

    Set mm before passing it to the UEFI firmware

    When reading the UEFI memory map we pass in a pointer to the memory to
    hold the map. Unfortunately it wasn't initialised before the first use
    so clang decided it was undefined behaviour so the entire loop was
    removed. This leads to everything in bi_load after this to also be
    removed as dead code.

    The next function after bi_load in the binary is efi_copy_init. The
    above caused us to enter efi_copy_init with a return address of the
    start of the function. Because of this it would enter an infinite
    loop of calling the function, allocating memory, then returning to
    the start of the function.

    PR:             264021

 stand/efi/loader/bootinfo.c | 1 +
 1 file changed, 1 insertion(+)
Comment 13 Danilo Egea Gondolfo freebsd_committer freebsd_triage 2022-05-21 11:21:36 UTC
(In reply to commit-hook from comment #12)
It fixed the problem for me. Thanks, Andrew.
Comment 14 Mark Millard 2022-05-22 20:41:43 UTC
(In reply to commit-hook from comment #12)

The updating to be based on a vintage that includes
the commit lead to the loader.efi from the update
working fine for aarch64 and for armv7 in my context,
taking care of what I'd reported on the arm list and
here.

Thanks Andrew.


Side note:

geli_export_key_metadata still assumes that:

keybuf = malloc(GELI_KEYBUF_SIZE);

will return keybuf != NULL, so pointing to usable
memory. But I looked and it appeared that for
the libsa context malloc is structured to be
able to return NULL, even though in my contexts
it does not for this call.
Comment 15 Kubilay Kocak freebsd_committer freebsd_triage 2022-05-30 23:17:30 UTC
(In reply to Jessica Clarke from comment #8)

Could you include the reproducer from comment 8 as an attachment here please, and might it be usable as a base for a regression test?
Comment 16 commit-hook freebsd_committer freebsd_triage 2022-06-07 14:24:13 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e7219c3818d1554830555278a32934c9fb4e7ac3

commit e7219c3818d1554830555278a32934c9fb4e7ac3
Author:     Andrew Turner <andrew@FreeBSD.org>
AuthorDate: 2022-05-21 10:45:41 +0000
Commit:     Andrew Turner <andrew@FreeBSD.org>
CommitDate: 2022-06-07 14:20:18 +0000

    Set mm before passing it to the UEFI firmware

    When reading the UEFI memory map we pass in a pointer to the memory to
    hold the map. Unfortunately it wasn't initialised before the first use
    so clang decided it was undefined behaviour so the entire loop was
    removed. This leads to everything in bi_load after this to also be
    removed as dead code.

    The next function after bi_load in the binary is efi_copy_init. The
    above caused us to enter efi_copy_init with a return address of the
    start of the function. Because of this it would enter an infinite
    loop of calling the function, allocating memory, then returning to
    the start of the function.

    PR:             264021
    (cherry picked from commit 0d6600b579be769b85f049ef421023316f21b5c3)

 stand/efi/loader/bootinfo.c | 1 +
 1 file changed, 1 insertion(+)