Bug 280038 - -Stable 14.1 on ARM compiler failure not seen in 14.1-RELEASE Pi3
Summary: -Stable 14.1 on ARM compiler failure not seen in 14.1-RELEASE Pi3
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: 14.1-STABLE
Hardware: arm64 Any
: --- Affects Some People
Assignee: freebsd-arm (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2024-06-28 16:59 UTC by karl
Modified: 2024-07-08 11:36 UTC (History)
3 users (show)

See Also:


Attachments
Reproducer binary (283.96 KB, application/x-compressed)
2024-07-02 10:55 UTC, karl
no flags Details
Reproducer script (901 bytes, application/x-compressed)
2024-07-02 10:56 UTC, karl
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description karl 2024-06-28 16:59:01 UTC
This is a rather odd problem and I'm uncertain of the scope.

Context: Pi3, checked multiple physical devices including one of the "newest" ones with connections for POE HATs as well as an older unit with identical results.  On a Pi4 4Gb, booting the SAME SD card there is no problem.

The package that fails is code I've had running for quite some time, and on 13.x-STABLE it works perfectly well.  I build using Crochet; -RELEASE, however, was checked both with my own worktree for releng/14.1 and stable/14.

The exact place in the source (which function it is compiling at the time) the compiler blows up varies to some degree but the crash is the same in all instances.  Whether I have the source on a UFS+Su/J filesystem on the SD card or I copy it to a tempfs (Ramdisk) doesn't matter so I surmise this is something that has changed either in the kernel or clang -- and it may be thread related.

That it never occurs on the Pi4 is troublesome as that implies its local to either the CPU on the "3" or its RAM architecture .vs. the 4, given that I am literally plugging the same SD card into each.  There is no evidence of RAM exhaustion or similar.

Here's an example of the crash; I am running this on the physical (serial) console so if there was a kernel complaint about memory or similar it would be embedded in this output.

First, the boot message from "dmesg" on the Pi3 (some elided but including the ARM CPU info):
---<<BOOT>>---
WARNING: Cannot find freebsd,dts-version property, cannot check DTB compliance
Copyright (c) 1992-2023 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 14.1-STABLE stable/14-n268036-9a53391b601d GENERIC arm64
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
VT(efifb): resolution 656x416
module scmi already present!
real memory  = 994041856 (947 MB)
avail memory = 945451008 (901 MB)
Starting CPU 1 (1)
Starting CPU 2 (2)
Starting CPU 3 (3)
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
arc4random: WARNING: initial seeding bypassed the cryptographic random device because it was not yet seeded and the knob 'bypass_before_seeding' was enabled.
random: entropy device external interface
kbd0 at kbdmux0
ofwbus0: <Open Firmware Device Tree>
simplebus0: <Flattened device tree simple bus> on ofwbus0
ofw_clkbus0: <OFW clocks bus> on ofwbus0
regfix0: <Fixed Regulator> on ofwbus0
clk_fixed2: clock-fixed has no clock-frequency
....
CPU  0: ARM Cortex-A53 r0p4 affinity:  0
                   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG>
 Instruction Set Attributes 0 = <CRC32>
 Instruction Set Attributes 1 = <>
 Instruction Set Attributes 2 = <>
         Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
         Processor Features 1 = <>
      Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,1TB PA>
Trying to mount root from ufs:/dev/mmcsd0s2a [ro]...
      Memory Model Features 1 = <8bit VMID>
      Memory Model Features 2 = <32bit CCIDX,48bit VA>
             Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
             Debug Features 1 = <>
         Auxiliary Features 0 = <>
         Auxiliary Features 1 = <>
AArch32 Instruction Set Attributes 5 = <CRC32,SEVL>
AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
CPU  1: ARM Cortex-A53 r0p4 affinity:  1
CPU  2: ARM Cortex-A53 r0p4 affinity:  2
CPU  3: ARM Cortex-A53 r0p4 affinity:  3
Release APs...done

And then....

root@rpi:/data/karl/HD-MCP # make clean
rm -f *.o hd-mcp hd-mcp.freeware license-server hd-commit
root@rpi:/data/karl/HD-MCP # make
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c config.c -o config.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c funcs.c -o funcs.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c hd-mcp.c -o hd-mcp.o
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: cc -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c hd-mcp.c -o hd-mcp.o
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module 'hd-mcp.c'.
4.      Running pass 'AArch64O0PreLegalizerCombiner' on function '@process_unit_get_response'
#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)
#1 0x0000000004b15650 (/usr/bin/cc+0x4b15650)
#2 0x0000000004ae16a0 (/usr/bin/cc+0x4ae16a0)
#3 0x000000008a02eeb8 (/lib/libthr.so.3+0x2aeb8)
cc: error: clang frontend command failed with exit code 139 (use -v to see invocation)
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
Target: aarch64-unknown-freebsd14.1
Thread model: posix
InstalledDir: /usr/bin
cc: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
cc: note: diagnostic msg: /tmp/hd-mcp-720fdc.c
cc: note: diagnostic msg: /tmp/hd-mcp-720fdc.sh
cc: note: diagnostic msg:

********************
*** Error code 1

Stop.
make: stopped in /data/karl/HD-MCP
root@rpi:/data/karl/HD-MCP #

The crash is always in libthr.so.3 and at that address; SOMETIMES if I re-execute the "make" command it will get past this file but then blows up in another one:

root@rpi:/data/karl/HD-MCP # make
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c hd-mcp.c -o hd-mcp.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c www.c -o www.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c slave.c -o slave.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c amcrest.c -o amcrest.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c license.c -o license.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c z-wave.c -o z-wave.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c malloc.c -o malloc.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-encryption.c -o S0-encryption.o
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: cc -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-encryption.c -o S0-encryption.o
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module 'S0-encryption.c'.
4.      Running pass 'AArch64O0PreLegalizerCombiner' on function '@generate_mac'
#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)
#1 0x0000000004b15650 (/usr/bin/cc+0x4b15650)
#2 0x0000000004ae16a0 (/usr/bin/cc+0x4ae16a0)
#3 0x000000008a39feb8 (/lib/libthr.so.3+0x2aeb8)
cc: error: clang frontend command failed with exit code 139 (use -v to see invocation)
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
Target: aarch64-unknown-freebsd14.1
Thread model: posix
InstalledDir: /usr/bin
cc: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
cc: note: diagnostic msg: /tmp/S0-encryption-2a4545.c
cc: note: diagnostic msg: /tmp/S0-encryption-2a4545.sh
cc: note: diagnostic msg:

********************
*** Error code 1

Stop.
make: stopped in /data/karl/HD-MCP

That file almost NEVER completes -- but once in a great while it will (!!) and when it does the executable that gets produced runs as expected.

The crash in the compiler, when it occurs, always has the same traceback to the same place in libthr.so.3 irrespective of which function is the one referenced in the compiler crash itself

Clearing the object directory and re-running the build does not change the outcome.  But if I build releng/14.1 (same Crochet, just changing the source to releng/14.1 from stable/14)  and boot THAT, it never crashes.  It also never crashes during build on *either* version of the OS if I am running on a Pi4 -- only on the 3.

S0-encryption.c is an unremarkable file that contains a handful of functions that are all related to the use of OpenSSL routines to perform encryption, decryption along with computing (and checking) a MAC against data packets; it is only some 400 lines of C code.

If the crash disappears on future updates to stable/14 I'll withdraw it as OBE, but since this implies there's a potential problem with thread handling on the Pi3 under 14/stable I wanted to stick it out there.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2024-06-28 17:54:40 UTC
"exit code 139" means that cc is dying with SIGSEGV, so at a glance this seems like a compiler bug of some kind.  As you note, it only appears on RPi3 though, so there is something platform dependent about the problem.

Does the generated reproducer actually work?  It might be worth posting that here to start.
Comment 2 Dimitry Andric freebsd_committer freebsd_triage 2024-06-28 18:00:17 UTC
Can you check dmesg for any out-of-memory errors? Or test your RAM?

If the crashes are fairly random, as they seem here, it is almost always due to memory errors: either corruption or failure to allocate something.

The backtrace to libthr.so.3 is most likely a red herring. The top few stack entries are interesting for diagnosing any crash, not the bottom ones (which are typically of the form "libthr is starting a thread here").
Comment 3 karl 2024-06-28 18:08:37 UTC
(In reply to Dimitry Andric from comment #2)

That error was from the console, so clearly there was no out-of-memory condition as it would have been logged either on that execution or through the console since a kernel complaint would show up there.  That was my first thought as well and in fact that S0-encryption.c file was split off from hd-mcp.c specifically to attempt to isolate that possibility because that particular source file is relatively large (although it has built on everything from a Pi2 forward on FreeBSD for close to 10 years across various revisions.)

As for hardware-based corruption I'm skeptical and believe it is an OS (either compiler or kernel) problem specific to something different between releng/14.1 and current (yesterday) stable/14 because (1) releng/14.1 does not exhibit it on the SAME HARDWARE and in fact the precise same build using Crochet, altering ONLY the FreeBSD source worktree so all the RPI-firmware files and u-boot are identical between the builds, (2) hardware is almost-certainly excluded because two entirely-different Pi3s from different generations (and early one without the header for POE and a new one that does have it) both behave identically and (3) a Pi4 which I own does not exhibit the problem on either stable/14 or releng/14.1

Before posting this I went back through git log on the tree to see if I could find something that looked like it might be involved between the time when releng/14.1 was branched and today so I could perform a bisection to test that, but found nothing that stood out at me as likely to be related.
Comment 4 karl 2024-06-28 18:36:10 UTC
(In reply to Mark Johnston from comment #1)

Yep.

root@rpi:/data/karl/HD-MCP # make
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-encryption.c -o S0-encryption.o
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: cc -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-encryption.c -o S0-encryption.o
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module 'S0-encryption.c'.
4.      Running pass 'AArch64O0PreLegalizerCombiner' on function '@generate_mac'
#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)
#1 0x0000000004b15650 (/usr/bin/cc+0x4b15650)
#2 0x0000000004ae16a0 (/usr/bin/cc+0x4ae16a0)
#3 0x000000008a3a9eb8 (/lib/libthr.so.3+0x2aeb8)
cc: error: clang frontend command failed with exit code 139 (use -v to see invocation)
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
Target: aarch64-unknown-freebsd14.1
Thread model: posix
InstalledDir: /usr/bin
cc: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
cc: note: diagnostic msg: /tmp/S0-encryption-9c36d0.c
cc: note: diagnostic msg: /tmp/S0-encryption-9c36d0.sh
cc: note: diagnostic msg:

********************
*** Error code 1

Stop.
make: stopped in /data/karl/HD-MCP
root@rpi:/data/karl/HD-MCP # mkdir REPRO
root@rpi:/data/karl/HD-MCP # cp /tmp/S0* REPRO
root@rpi:/data/karl/HD-MCP # cd REPRO
root@rpi:/data/karl/HD-MCP/REPRO # ls
S0-encryption-9c36d0.c  S0-encryption-9c36d0.sh
root@rpi:/data/karl/HD-MCP/REPRO # sh S0*.sh
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /usr/bin/cc -cc1 -triple aarch64-unknown-freebsd14.1 -emit-obj -mrelax-all -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name S0-encryption.c -mrelocation-model static -mframe-pointer=non-leaf -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu generic -target-feature +v8a -target-feature +fp-armv8 -target-feature +neon -target-abi aapcs -debug-info-kind=standalone -dwarf-version=4 -debugger-tuning=gdb -fdebug-compilation-dir=/data/karl/HD-MCP -fcoverage-compilation-dir=/data/karl/HD-MCP -D VERSION=\"8.0.0-LocalAuth\" -Wstrict-prototypes -ferror-limit 19 -fno-signed-char -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fcolor-diagnostics -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -x c S0-encryption-9c36d0.c
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module 'S0-encryption-9c36d0.c'.
4.      Running pass 'AArch64O0PreLegalizerCombiner' on function '@generate_mac'
#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)
#1 0x0000000004b15650 (/usr/bin/cc+0x4b15650)
#2 0x0000000004b17cb0 (/usr/bin/cc+0x4b17cb0)
#3 0x000000008b94ceb8 (/lib/libthr.so.3+0x2aeb8)
Segmentation fault (core dumped)
root@rpi:/data/karl/HD-MCP/REPRO #


Those two reproducer files are now on the SD card (out of tmp, which is a tempfs) so now I halt that machine and place the same card in a Pi4, boot it and try running the reproducer there:

root@rpi:/data/karl/HD-MCP/REPRO # ls
S0-encryption-9c36d0.c  S0-encryption-9c36d0.sh cc.core
root@rpi:/data/karl/HD-MCP/REPRO # ls -al
total 25100
drwxr-xr-x  2 root wheel      512 Jun 28 18:30 .
drwxr-xr-x  5 root wheel     1024 Jun 28 18:30 ..
-rw-r--r--  1 root wheel  2578055 Jun 28 18:30 S0-encryption-9c36d0.c
-rw-r--r--  1 root wheel     2198 Jun 28 18:30 S0-encryption-9c36d0.sh
-rw-------  1 root wheel 33755136 Jun 28 18:30 cc.core
root@rpi:/data/karl/HD-MCP/REPRO # sh S0*.sh
root@rpi:/data/karl/HD-MCP/REPRO #

No crash.

root@rpi:/data/karl/HD-MCP/REPRO # uname -v
FreeBSD 14.1-STABLE stable/14-n268036-9a53391b601d GENERIC
root@rpi:/data/karl/HD-MCP/REPRO #
root@rpi:/data/karl/HD-MCP/REPRO # dmesg|grep ARM
psci0: <ARM Power State Co-ordination Interface Driver> on ofwbus0
gic0: <ARM Generic Interrupt Controller> mem 0x40041000-0x40041fff,0x40042000-0x40043fff,0x40044000-0x40045fff,0x40046000-0x40047fff irq 30 on simplebus0
generic_timer0: <ARMv8 Generic Timer> irq 4,5,6,7 on ofwbus0
Timecounter "ARM MPCore Timecounter" frequency 54000000 Hz quality 1000
Event timer "ARM MPCore Eventtimer" frequency 54000000 Hz quality 1000
bcm2835_cpufreq0: ARM 600MHz, Core 200MHz, SDRAM 400MHz, Turbo OFF
CPU  0: ARM Cortex-A72 r0p3 affinity:  0
CPU  1: ARM Cortex-A72 r0p3 affinity:  1
CPU  2: ARM Cortex-A72 r0p3 affinity:  2
CPU  3: ARM Cortex-A72 r0p3 affinity:  3
root@rpi:/data/karl/HD-MCP/REPRO #

Same SD card, but in the "4".

Again if I run this same compile on releng/14.1 it succeeds on BOTH.
Comment 5 Mark Millard 2024-06-29 02:05:05 UTC
(In reply to karl from comment #4)

I expect that markj might have been asking for you to provide copies of
files like the ones mentioned in:

Preprocessed source(s) and associated run script(s) are located at:
cc: note: diagnostic msg: /tmp/S0-encryption-9c36d0.c
cc: note: diagnostic msg: /tmp/S0-encryption-9c36d0.sh

if they reproduce  the problem for you --so others could attempt
their own replications of the the problem, possibly even with
clang built with debug information so that the backtrace is more
useful. As stands you the only one with a context explore beyond
exactly what you have written.
Comment 6 karl 2024-06-30 16:12:23 UTC
The only difference I see in the logs digging around in the clang-related things is this one commit that is in 14-STABLE and NOT in releng/14.1 is found in the lib/clang directory:

commit f1e3279983d6db1001af5fc9fb3a9821a1c353ef
Author: Dimitry Andric <dim@FreeBSD.org>
Date:   Fri May 24 17:51:19 2024 +0200

    Merge llvm-project release/18.x llvmorg-18.1.6-0-g1118c2e05e67

    This updates llvm, clang, compiler-rt, libc++, libunwind, lld, lldb and
    openmp to llvm-project release/18.x llvmorg-18.1.6-0-g1118c2e05e67.

    PR:             276104
    MFC after:      3 days

    (cherry picked from commit 3a0793336edfc21cb6d4c8c5c5d7f1665f3e6c5a)

The last commit in releng/14.1 is from 4 May; that commit bumps that from 18.1.5.0 to 18.1.6.

This implicates the following PR which was marked closed as fixing a problem with compilation of something else, particularly against PowerPC targets.....

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276104
Comment 7 karl 2024-06-30 17:40:11 UTC
I have not run this all the way down yet on a bisect but bisecting between those two good/bad and then running it down to here works:

FreeBSD 14.1-STABLE n267672-ddabe1d3c515 GENERIC

So I suspect that commit in lib/clang is the bad one.  Where I am right now is here:

root@NewFS:/usr/src.14-STABLE/lib/clang # git bisect bad
Bisecting: 49 revisions left to test after this (roughly 6 steps)
[ddabe1d3c51556c84f830b0203204c55b495e57b] mlx5en: add diagnostic in one more case of failed eeprom read preparation

This bisection (after marking the previous one bad as it resulted in a missing include file) built and does NOT reproduce the problem.

Will continue, but I'm reasonably sure which commit it is that causes the problem at this point.
Comment 8 karl 2024-07-01 12:02:06 UTC
Update -- that specific commit IS NOT the problem.  Doing a full bisect now from where my tree is on stable/14 to there; the build with that commit in it that was suspect works (this is a bit slow because Crochet will not properly build partial changes, so I have to make clean and rebuild world and kernel on each.)
Comment 9 karl 2024-07-01 16:32:19 UTC
Update: We're down to somewhere in between these....

root@NewFS:/usr/src.14-STABLE # git bisect good
Bisecting: 27 revisions left to test after this (roughly 5 steps)
[dd8575e19ae02e7e8a10abd8592ac764263e9176] qlnx: Use device_set_descf()
root@NewFS:/usr/src.14-STABLE # git bisect good
Bisecting: 13 revisions left to test after this (roughly 4 steps)
[a3f7b81fdd2205207085ba5d038b402aba748c6e] mbuf: provide m_freemp()
root@NewFS:/usr/src.14-STABLE # git bisect bad
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[6e345bea25d476baf6de7fb3b60127d39b464837] makefs/zfs: Add a helper function for adding ZAP entries
Comment 10 karl 2024-07-01 20:37:53 UTC
DING DING DING

Winner winner chicken dinner -- this is the bad commit:

root@rpi:/data/HD-MCP # make clean
rm -f *.o hd-mcp hd-mcp.freeware license-server hd-commit
root@rpi:/data/HD-MCP # make
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c config.c -o config.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c funcs.c -o funcs.o
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c hd-mcp.c -o hd-mcp.o
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: cc -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c hd-mcp.c -o hd-mcp.o
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module 'hd-mcp.c'.
4.      Running pass 'AArch64O0PreLegalizerCombiner' on function '@process_unit_get_response'
#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)
#1 0x0000000004b15650 (/usr/bin/cc+0x4b15650)
#2 0x0000000004ae16a0 (/usr/bin/cc+0x4ae16a0)
#3 0x0000000089daaeb8 (/lib/libthr.so.3+0x2aeb8)
cc: error: clang frontend command failed with exit code 139 (use -v to see invocation)
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
Target: aarch64-unknown-freebsd14.1
Thread model: posix
InstalledDir: /usr/bin
cc: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
cc: note: diagnostic msg: /tmp/hd-mcp-200d21.c
cc: note: diagnostic msg: /tmp/hd-mcp-200d21.sh
cc: note: diagnostic msg:

********************
*** Error code 1

Stop.
make: stopped in /data/HD-MCP



root@NewFS:/usr/src.14-STABLE # git bisect start
status: waiting for both good and bad commits
root@NewFS:/usr/src.14-STABLE # git bisect good f1e3279983d6db1001af5fc9fb3a9821a1c353ef
status: waiting for bad commit, 1 good commit known
root@NewFS:/usr/src.14-STABLE # git bisect bad 939f5a7b2bfbd7ba3b23ddc691e12e8a332623f4
Bisecting: 111 revisions left to test after this (roughly 7 steps)
[7ad7453748e2adafa1e1a3e44b02fc852d4c5301] LinuxKPI: 802.11: change teardown order to avoid iwlwifi firmware crashes
root@NewFS:/usr/src.14-STABLE # git branch
* (no branch, bisect started on stable/14)
+ main
+ releng/14.1
+ stable/12
+ stable/13
  stable/14
root@NewFS:/usr/src.14-STABLE # git bisect bad
Bisecting: 55 revisions left to test after this (roughly 6 steps)
[ac658a7c760d9db9fcd11cdeb3b858411dedf754] rc: Set var_run_enable to enable by default
root@NewFS:/usr/src.14-STABLE # git bisect good
Bisecting: 27 revisions left to test after this (roughly 5 steps)
[dd8575e19ae02e7e8a10abd8592ac764263e9176] qlnx: Use device_set_descf()
root@NewFS:/usr/src.14-STABLE # git bisect good
Bisecting: 13 revisions left to test after this (roughly 4 steps)
[a3f7b81fdd2205207085ba5d038b402aba748c6e] mbuf: provide m_freemp()
root@NewFS:/usr/src.14-STABLE # git bisect bad
Bisecting: 6 revisions left to test after this (roughly 3 steps)
[6e345bea25d476baf6de7fb3b60127d39b464837] makefs/zfs: Add a helper function for adding ZAP entries
root@NewFS:/usr/src.14-STABLE # git bisect good
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[a40287d6312e598fc65c5a7bbdefe6f9e15b7a5f] simd(7): add missing aarch64 SIMD functions
root@NewFS:/usr/src.14-STABLE # git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[3562d64e794b2614f15728c8b7e3a6dff0b644a7] sqlite3: Vendor import of sqlite3 3.46.0
root@NewFS:/usr/src.14-STABLE # git bisect good
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[55c5dad2f305f74d1ff5ca85c453635511aab9b2] Merge commit 382f70a877f0 from llvm-project (by Louis Dionne):
root@NewFS:/usr/src.14-STABLE # git bisect bad
55c5dad2f305f74d1ff5ca85c453635511aab9b2 is the first bad commit
commit 55c5dad2f305f74d1ff5ca85c453635511aab9b2 (HEAD)
Author: Dimitry Andric <dim@FreeBSD.org>
Date:   Fri Jun 7 20:42:53 2024 +0200

    Merge commit 382f70a877f0 from llvm-project (by Louis Dionne):

      [libc++][NFC] Rewrite function call on two lines for clarity (#79141)

      Previously, there was a ternary conditional with a less-than comparison
      appearing inside a template argument, which was really confusing because
      of the <...> of the function template. This patch rewrites the same
      statement on two lines for clarity.

    Merge commit d129ea8d2fa3 from llvm-project (by Vitaly Buka):

      [libcxx] Align `__recommend() + 1`  by __endian_factor (#90292)

      This is detected by asan after #83774

      Allocation size will be divided by `__endian_factor` before storing. If
      it's not aligned,
      we will not be able to recover allocation size to pass into
      `__alloc_traits::deallocate`.

      we have code like this
      ```
       auto __allocation = std::__allocate_at_least(__alloc(), __recommend(__sz) + 1);
          __p               = __allocation.ptr;
          __set_long_cap(__allocation.count);

      void __set_long_cap(size_type __s) _NOEXCEPT {
          __r_.first().__l.__cap_     = __s / __endian_factor;
          __r_.first().__l.__is_long_ = true;
        }

      size_type __get_long_cap() const _NOEXCEPT {
          return __r_.first().__l.__cap_ * __endian_factor;
        }

      inline ~basic_string() {
          __annotate_delete();
          if (__is_long())
            __alloc_traits::deallocate(__alloc(), __get_long_pointer(), __get_long_cap());
        }
      ```
      1. __recommend() -> even size
      2. `std::__allocate_at_least(__alloc(), __recommend(__sz) + 1)` - > not
      even size
      3. ` __set_long_cap() `- > lose one bit of size for __endian_factor == 2
      (see `/ __endian_factor`)
      4. `__alloc_traits::deallocate(__alloc(), __get_long_pointer(),
      __get_long_cap())` -> uses even size (see `__get_long_cap`)

    This should fix incorrect deallocation sizes for some instances of
    std::string. Memory profiling or debugging tools like AddressSanitizer,
    LeakSanitizer or TCMalloc could then complain about the the size passed
    to a deallocation not matching the size originally passed to the
    allocation.

    Reported by:    Aliaksei Kandratsenka <alkondratenko@gmail.com>
    PR:             279560
    MFC after:      3 days

    (cherry picked from commit ead8e4c081e5c4de4d508fc353f381457b058ca6)

 contrib/llvm-project/libcxx/include/string | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
root@NewFS:/usr/src.14-STABLE #
Comment 11 Mark Millard 2024-07-01 21:17:59 UTC
(In reply to karl from comment #10)

How does anyone else but you test for the status of the problem?
Does anyone have the source code that you are compiling so that
they have a chance of testing? Some other source code known to
reproduce the problem in your context? An example might be a
smaller subset of the original source code.
Comment 12 karl 2024-07-01 21:37:21 UTC
(In reply to Mark Millard from comment #11)
It won't allow me to attach it -- even the "limited" case, S0-encryption.c that the compiler dumps as a reproducer, is ~2.5Mb as a result of the preprocessor includes.  The file itself is a whopping 458 lines.

I am now building with that single commit reverted on my local stable/14 tree as another test.
Comment 13 karl 2024-07-01 21:56:41 UTC
With that commit reverted (its 3 lines of change) the build reliably completes.

I will take the reversion back out and see if I can remove enough includes which aren't necessary for that specific routine to get it under the attach limit and see if it still crashes.  Unfortunately the crash message does not tell me which line of the ~2.5MB dump of the code with the processor includes caused it to blow up.
Comment 14 karl 2024-07-02 02:37:44 UTC
Breaking up the S0-encryption file into constituent functions (specifically breaking two of the functions in that short file into separate files) makes the fault intermittent.

That is, it will frequently SEGV, but then if I type "make" again it might complete -- or it might blow up again, even down to the single routine level.

Here's one of the routines that *sometimes* blows with the same fault:

root@rpi:/data/HD-MCP # make
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-mac.c -o S0-mac.o
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: cc -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-mac.c -o S0-mac.o
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module 'S0-mac.c'.
4.      Running pass 'AArch64O0PreLegalizerCombiner' on function '@generate_mac'
#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)
#1 0x0000000004b15650 (/usr/bin/cc+0x4b15650)
#2 0x0000000004ae16a0 (/usr/bin/cc+0x4ae16a0)
#3 0x000000008adfceb8 (/lib/libthr.so.3+0x2aeb8)
cc: error: clang frontend command failed with exit code 139 (use -v to see invocation)
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
Target: aarch64-unknown-freebsd14.1
Thread model: posix
InstalledDir: /usr/bin
cc: note: diagnostic msg: Error generating preprocessed source(s).
*** Error code 1

Stop.
make: stopped in /data/HD-MCP


This is the source of that file -- I pulled this single routine out. Yet, it does not blow up *always* when I separate things out (I also pulled out the "decrypt_packet" routine into a different file as well) - after that blow up if I execute make again...

root@rpi:/data/HD-MCP # make
cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-mac.c -o S0-mac.o
cc -g -o hd-mcp hd-mcp.o www.o config.o slave.o amcrest.o license.o funcs.o z-wave.o malloc.o S0-encryption.o S0-decrypt.o S0-mac.o root-include.o boot-include.o -lm -lcrypt -lssl -lpthread -lcrypto -lgpio

The build completes.

But again, with the above commit reverted the compiler *never* blows up on the Pi3, nor does it blow up with the commit in on a Pi4 or on an AMD64 box, all of which are on the same revision of 14/stable.


/*
 * HomeDaemon-MCP - S0-generate_mac.c
 * Copyright 2016/2017/2024 Karl Denninger (karl@denninger.net);
 * all rights reserved.
 * Unauthorized reproduction or distribution of this file, a component
 * of HomeDaemon MCP, is expressly prohibited.
 */

#include        <stdio.h>
#include        <ctype.h>
#include        <sys/types.h>
#include        <errno.h>
#include        <stdlib.h>
#include        <syslog.h>
#include        <stdarg.h>
#include        <string.h>

#include        "defs.h"

#ifndef OPENSSL
#undef  GPIO
#undef  ANALOG
#undef  IIC
#endif  // OPENSSL

#include        "forwards.h"
#include        "externs.h"

#ifdef  OPENSSL
#include        <openssl/conf.h>
#include        <openssl/x509v3.h>
#include        <openssl/ssl.h>
#include        <openssl/evp.h>
#include        <openssl/rsa.h>
#include        <openssl/engine.h>


/*
 * generate_mac -- generate the MAC for the packet we are about to send.
 * This only actually returns 8 bytes of MAC, but is a bit more complex
 * than that to compute.
 */
void    generate_mac(int unit, int mode, int from, int to, unsigned char *msgbuf, unsigned char *iv, int len, unsigned char *out)
//int   unit;   /* The unit to get the temp key from */
//int   mode;   /* The command subclass we're encapsulating */
//int   from;   /* The unit we are sending from */
//int   to;             /* The unit we are sending to */
//unsigned char *msgbuf;
//unsigned char *iv;
//int   len;            /* Length of msgbuf */
//unsigned char *out;   /* Output in binary form, 8 byte buffer assumed */

{
        int     enclen;
        unsigned char   cipher[16];             /* Our encrypted IV */
        unsigned char   cipher2[16];    /* Temporary holding area */
        unsigned char   tmphold[16];    /* Holding area for computation */
        unsigned char   auth[16];               /* Where the MAC goes temporarily */

        unsigned char   buffer[256];
        int     bufsize;
        int     loc;

        int     x, y;

        EVP_CIPHER_CTX  *EVP_auth_ctx;

        bzero(buffer, 256);
        bzero(tmphold, 16);
        bzero(cipher, 16);
        bzero(cipher2, 16);
/*
 * Build a buffer to be turned into the MAC
 */
        buffer[0] = (unsigned char) mode;
        buffer[1] = (unsigned char) from;
        buffer[2] = (unsigned char) to;
        buffer[3] = (unsigned char) len;        /* Length of the encrypted data */
        memcpy(&buffer[4], msgbuf, len);        /* Copy it in */
        bufsize = len + 4;      /* Initialize the buffer size */

/*
 * Now encrypt our IV with the auth key using ecb.
 */

        EVP_auth_ctx = EVP_CIPHER_CTX_new();
        EVP_EncryptInit_ex(EVP_auth_ctx, EVP_aes_128_ecb(), NULL, units[unit].AuthKey, NULL);
        EVP_EncryptUpdate(EVP_auth_ctx, cipher, &enclen, iv, 16);

        if (enclen != 16) {     /* Something's wrong */
                panic("Wrong encryption return length!");
        }

        bzero(tmphold, 16);     /* Zero the temporary holding area */
        loc = 0;        /* And reset our location */
        for (x = 0; x < bufsize; x++) { /* Iterate over the buffer */
                tmphold[loc++] = buffer[x];     /* Hold it */
                if (loc == 16) {        /* Encrypt when filled */
                        for (y = 0; y < 16; y++) {
                                cipher[y] = tmphold[y] ^ cipher[y];
                                tmphold[y] = 0; /* And clear it... */
                        }
                        loc = 0;        /* Clear counter */
                        EVP_EncryptUpdate(EVP_auth_ctx, cipher2, &enclen, cipher, 16);
                        if (enclen != 16) {     /* Something's wrong */
                                panic("Wrong encryption return length!");
                        }
                        memcpy(cipher, cipher2, 16);    /* Copy back */
                }
        }
        if (loc > 0) {  /* If there's a partial block encrypt it too */
                for (y = 0; y < 16; y++) {      /* Do the last piece */
                        cipher[y] = tmphold[y] ^ cipher[y];
                }
                EVP_EncryptUpdate(EVP_auth_ctx, cipher2, &enclen, cipher, 16);
                if (enclen != 16) {     /* Something's wrong */
                        panic("Wrong encryption return length!");
                }
                memcpy(cipher, cipher2, 16);    /* Copy back */
        }
        EVP_CIPHER_CTX_free(EVP_auth_ctx);
        memcpy(out, cipher, 8);
        return;
}

#endif  // OPENSSL
Comment 15 Mark Millard 2024-07-02 03:54:52 UTC
(In reply to karl from comment #12)

Compress the file and submit that. Text files generally compress well.
Comment 16 Mark Millard 2024-07-02 03:58:30 UTC
(In reply to karl from comment #14)

This should allow someone that has build the system clang
without stripping symbols to get a backtrace if they
manage to reproduce any examples of the failures.
Comment 17 Mark Millard 2024-07-02 04:05:53 UTC
(In reply to karl from comment #14)

defs.h is  missing from what is presented.

I was only checking on completeness of the supplied source code:

cc  -g -Wstrict-prototypes -DVERSION=\"8.0.0-LocalAuth\" -c S0-mac.c -o S0-mac.o
S0-mac.c:20:17: fatal error: 'defs.h' file not found
   20 | #include        "defs.h"
      |                 ^~~~~~~~
1 error generated.
Comment 18 Dimitry Andric freebsd_committer freebsd_triage 2024-07-02 08:36:48 UTC
Can you please upload the original reproducer files somewhere in full? 

One thing that I can think of is that your clang binary is somehow bad, or not recompiled against the fixed libc++. I didn't bump the clang internal version for this, but maybe that was wrong.

In any case, I cannot help if I do not have a reproduction scenario.
Comment 19 karl 2024-07-02 10:55:41 UTC
Created attachment 251827 [details]
Reproducer binary
Comment 20 karl 2024-07-02 10:56:02 UTC
Created attachment 251828 [details]
Reproducer script
Comment 21 karl 2024-07-02 11:01:13 UTC
(In reply to Dimitry Andric from comment #18)
Reproducer binary and script, compressed, uploaded.

That's a minimalist case from the code and does not blow up every time on the Pi3-- and invoking it on my AMD64 primary build box, or the Pi4 booted from the same media as the 3, never blows up.
Comment 22 Mark Millard 2024-07-02 14:58:41 UTC
The .sh has:

"-fdebug-compilation-dir=/data/HD-MCP"
"-fcoverage-compilation-dir=/data/HD-MCP"

that presume some local context and might need to be adjusted
by folks looking to use the script.
Comment 23 karl 2024-07-02 15:35:00 UTC
(In reply to Mark Millard from comment #22)
That is the directory in which the "make" command was given.
Comment 24 karl 2024-07-02 15:49:53 UTC
(In reply to Mark Millard from comment #22)
IMHO the correct line of inquiry here, given that I've identified the commit that causes this to happen, is for someone with sufficient knowledge of both the llvm internals *and* the ARM Cortex processor differences between the Pi3 and 4 to take a critical look at that delta, particularly the "endian" element of it.

Padding an alignment should never burn you (it might waste RAM on a temporary basis but if I need 3 bytes and allocate 4 that doesn't do harm other than the extra byte used) but if you wind up overflowing a pointer or other allocation somewhere on the next reference you're essentially certain to take a SEGV.
Comment 25 Dimitry Andric freebsd_committer freebsd_triage 2024-07-02 18:27:24 UTC
I couldn't reproduce any crash on the only arm-like machine I have, which is an arm64 VM on my Mac. Even running the test case a 1000 times shows no hiccups.

I tried this on both 14.1-RELEASE and 14.1-RELEASE-p2 (which includes the libc++ fix).

Does this crash really only occur on a 32-bit arm host?
Comment 26 karl 2024-07-02 18:49:14 UTC
(In reply to Dimitry Andric from comment #25)
Yes.

On stable/14 with the commit I bisected to in, if I boot the same SD card on a Pi4, the reproducer does *not* crash irrespective of how many times.  It also does not crash on an AMD64 box (my build machine) again, irrespective of how many attempts are made.

On a Pi3, with the commit in on stable/14, it *does* crash quite-reliably (not every single time, but most of the time; if I do not pull out routines from that original file then it is 1 out of 100, perhaps, that file will complete but even just that 100-line bit of source blows up most of the time, and the reproducer it dumps does so as well.)  On releng/14.1, which does not have the commit, the reproducer does not crash.

If I execute a git revert on *just that commit* on stable/14, leaving everything else alone, it does not crash no matter how many times I attempt to compile just the reproducer I posted, again, on the same Pi3 hardware.
Comment 27 karl 2024-07-02 18:59:18 UTC
Note that the build (and running kernel) is aarch64, not aarch32.

From the boot dmesg:


FreeBSD 14.1-STABLE stable/14-n268044-939f5a7b2bfb GENERIC arm64
.....
CPU  0: ARM Cortex-A53 r0p4 affinity:  0
                   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG>
 Instruction Set Attributes 0 = <CRC32>
 Instruction Set Attributes 1 = <>
 Instruction Set Attributes 2 = <>
         Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
         Processor Features 1 = <>
Trying to mount root from ufs:/dev/mmcsd0s2a [ro]...
      Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,1TB PA>
      Memory Model Features 1 = <8bit VMID>
      Memory Model Features 2 = <32bit CCIDX,48bit VA>
             Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
             Debug Features 1 = <>
         Auxiliary Features 0 = <>
         Auxiliary Features 1 = <>
AArch32 Instruction Set Attributes 5 = <CRC32,SEVL>
AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
CPU  1: ARM Cortex-A53 r0p4 affinity:  1
CPU  2: ARM Cortex-A53 r0p4 affinity:  2
CPU  3: ARM Cortex-A53 r0p4 affinity:  3
Release APs...done

root@rpi:~ # sysctl -a|grep hw.machine_arch
hw.machine_arch: aarch64
Comment 28 Mark Millard 2024-07-03 00:37:13 UTC
Not reproducible so far on the RPi3B here using an official FreeBSD
build of stable/14 (installed via pkgbase):

. . .
U-Boot 2023.07.02 (Nov 03 2023 - 06:00:41 +0000)



DRAM:  948 MiB

RPI 3 Model B (0xa02082)

. . .
FreeBSD 14.1-STABLE stable/14-n268059-536a452cc4e3 GENERIC arm64
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
. . .
bcm2835_cpufreq0: ARM 600MHz, Core 250MHz, SDRAM 400MHz, Turbo OFF
CPU  0: ARM Cortex-A53 r0p4 affinity:  0
. . .

This is without swap being enabled.

For reference:

yours: FreeBSD 14.1-STABLE stable/14-n268044-939f5a7b2bfb GENERIC arm64
mine:  FreeBSD 14.1-STABLE stable/14-n268059-536a452cc4e3 GENERIC arm64

As I understand, you are building yours via the non-standard technique
called Crochet. By contrast I'm using official builds: I'm not doing
the builds at all. There might be differences based on that that you
would need to track down. (You still have the only reproduction
context that is known.)

Technically the media that I used had an update sequence/history that
was ( -p2 and before not recent ):

FreeBSD-14.0-RELEASE-arm64-aarch64-RPI.img.xz uncompressed and dd'd to the media
freebsd-update'd to 14.0-RELEASE-p2
freebsd-update'd to 14.0-RELEASE-p8 ( the kernel of that is from -p6 )
pkgbase update'd to stable/14-n268059-536a452cc4e3

I used the media in multiple aarch64 systems over this time, not
limited to RPi*'s by any means. Very little configuration was involved.
Serial console and ssh over ethernet access. Some stages prior to
FreeBSD's loader.efi tend to be more system specific but are generally
handled via being on separate media.

It seems that it would be good if you tested an official build in your
context at some point, not just ones that you build. (Unfortunately,
there are no stable/14 snapshots available until Friday/Saturday.)

If you get a .c/.sh pair for the full original program that you
indicate almost always fails for you and, then, if you make the
compressed variants available, I might at some point be able to try
with that in my context.

I expect that some of the hypotheses that have been made so far are
not a good fit with my RPi3B results  and, so, that more explorations
and identifications of context differences need some investigation.
Comment 29 karl 2024-07-03 01:15:03 UTC
You have a later revision from perusing the log.  I don't see anything in the commit log between those two points that would bear on this, but as you noted there hasn't been an official snapshot available for a while (I looked before reporting this, so source build was my only real option.)

I have no swap; Crochet is for all intents and purposes a NanoBSD build, but that "official" means of building from source hasn't worked on the Pi series for quite a while, and inquiry said it hasn't been worked on.  It IS cross-compiled on an AMD64 system (I don't know if the official builds are done native or via cross-compilation, and if they're native it might well be that the compiler change exposed a problem with that.)

I'll pull an update; the top of stable/14 is (as of right now) 1a0314d6e30554fc2b07caa5121b00956f416cc4, which has the sshd fix in it that was recently put up as an advisory, so I want that anyway, and see if I can still reproduce -- if so, then I'll go grab an "official" if they're back online and see if it happens there as well.
Comment 30 Mark Millard 2024-07-03 01:18:18 UTC
As a way of identifying the specific pkgbase snaphot vintage
involved:

# ls -lodTt /var/cache/pkg/*.snap*.pkg | grep -v "^l" | sed -E 's@^[^/]*(/.*/pkg/([^-]*-)(.*)(\.snap[^~]*)~[^.]*\.pkg)$@\2\4@' | sort -r | uniq
FreeBSD-.snap20240702082739

If .pkg files for older pkgbase snapshots were still present in
/var/cache/pkg/ as well, it would show those too, most recent first
to oldest last. pkgbase can do incremental updates, so the .pkg
files can be from a mix of snapshots. But /var/cache/pkg/ can have
.pkg files that were no longer used in the most recent update:
older history.
Comment 31 karl 2024-07-03 01:21:26 UTC
(In reply to Mark Millard from comment #30)
I just did a git "pull --ff-only" which grabbed nothing (I did so on this tree as its on several of my AMD64 systems as well, and the sshd CVE came out recently) and am running a build now.

If this reproduces then I'll grab the series of things you've got and see what I can find in terms of differences.  Among the llvm/clang components there should, in theory at least, not be any (assuming cross-compile and native produce the same deterministic output of course.)
Comment 32 Mark Millard 2024-07-03 01:44:20 UTC
(In reply to karl from comment #29)

Unlike for ports, historically official system builds are done on amd64.
I expect such is true for pkgbase builds as well but, as far as I know,
such details are not publicly available.

Artifact builds, such as in:

https://artifact.ci.freebsd.org/snapshot/14.0-STABLE/536a452cc4e388454d829144dab95927ec39128f/arm64/aarch64/

also have *-dbg.txz files that provide debugging/symbol
information. pgkbase builds have *-dbg-*.pkg that provide
debug/symbol information. (Artifact builds have filled in
/etc/ . . . and the like and some *.txz files should not be
expanded over something that should be preserved. I tend to
use such via from-scratch media that has nothing to be
preserved.)

An advantage here of *-dbg.txz and *-dbg*.pkg materials is
that any backtrace should be more useful, referencing source
files and possibly line numbers.

snapshot builds and release builds do not provide such. How you
are building does not. My personal builds of main are set up
to have the symbols and debug information --but for an optimized
build. (That tends to make the results more of a mixed bag but
is better then the results for stripped/no-debug-info builds.)
Comment 33 karl 2024-07-03 02:00:16 UTC
(In reply to Mark Millard from comment #32)

One question that is raised by non-reproduction on your end but with a clang version new enough that the change is in it, but I don't know how to cleanly answer that it without setting up a completely-separate build environment: What builds the -STABLE snapshots and thus pkg-base bits?

The reason is this -- my build machine is on 14.1 and the revision it has both for the kernel and compiler is *after* the patch, so its cross-compiling *with the patch in* as the source compiler.  If the build environment that is putting the update files together is running releng/14.1 then it does *not* have the llvm change in it and thus it is crossbuilding without it, but against source that has it.

I wouldn't exactly expect that to matter but perhaps it does....

If I get inconsistent results I can probably set up a releng/14.1 machine and cross-build on that with the current stable/14 codebase to see if I get the same results.  If that indeed is the case (e.g. cross-building with releng/14.1 doesn't cause it, but cross-building with stable/14 does) then when this gets rolled forward on the official build machines (e.g. 14.2) I would expect it to become a generalized problem on the Pi3.
Comment 34 Mark Millard 2024-07-03 02:10:07 UTC
(In reply to Mark Millard from comment #30)

I should note about the pkgbase version in "mine" for:

yours: FreeBSD 14.1-STABLE stable/14-n268044-939f5a7b2bfb GENERIC arm64
mine:  FreeBSD 14.1-STABLE stable/14-n268059-536a452cc4e3 GENERIC arm64

that the world could be somewhat newer or older. The above for "mine"
is about what was involved in the kernel build. Absent changes to the
kernel, the world may have last changed before that or after that for
what the world was based on.

That is one of the reasons that I also produced and showed:

FreeBSD-.snap20240702082739

But I'm not sure how one coverts snap20240702082739 for stable/14
into the exact set of source code used: it would just be the starting
point for figuring such out. Notably, source code that is used both
in kernel builds and in world builds, need not have used the same
source code for both in a pkg build, at least as far as I can tell.
The src and sys-src packages only have one of the 2 in such cases,
for example.
Comment 35 karl 2024-07-03 02:14:38 UTC
(In reply to Mark Millard from comment #34)
With my stable/14 codebase pulled forward to the current time (and the build machine having the clang change in) I still get the blow-ups in the same place and, as before, if I repeatedly execute it once in a while it does complete.

I'll go grab official images and try to run this down tomorrow to see if I can figure out what's the differentiating factor that produces different results.  May take a while, particularly if I have to stand up a build machine on releng/14.1
Comment 36 karl 2024-07-03 02:20:52 UTC
(In reply to karl from comment #35)
We do have the same clang and commit, along with the CPU parameters, showing up in the boot message:

Loading kernel...
/boot/kernel/kernel text=0x2a8 text=0x9dcba0 text=0x260b2c data=0x150eb8 data=0x0+0x2bc000 0x8+0x151848+0x8+0x17a846/
Loading configured modules...
can't find '/boot/entropy'
can't find '/etc/hostid'
Using DTB provided by EFI at 0x39f01000.
EFI framebuffer information:
addr, size     0x3eaf0000, 0x10a800
dimensions     656 x 416
stride         656
masks          0x00ff0000, 0x0000ff00, 0x000000ff, 0xff000000
---<<BOOT>>---
WARNING: Cannot find freebsd,dts-version property, cannot check DTB compliance
Copyright (c) 1992-2023 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 14.1-STABLE stable/14-n268064-1a0314d6e305 GENERIC arm64
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
VT(efifb): resolution 656x416
module scmi already present!
real memory  = 994041856 (947 MB)
avail memory = 945451008 (901 MB)
Starting CPU 1 (1)
Starting CPU 2 (2)
Starting CPU 3 (3)
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
arc4random: WARNING: initial seeding bypassed the cryptographic random device because it was not yet seeded and the knob 'bypass_before_seeding' was enabled.
random: entropy device external interface

....
bcm2835_cpufreq0: ARM 600MHz, Core 250MHz, SDRAM 400MHz, Turbo OFF
CPU  0: ARM Cortex-A53 r0p4 affinity:  0
Comment 37 karl 2024-07-03 16:20:10 UTC
Well this is a problem with that image and getting enough installed to be able to do anything like, oh, pkgbase it forward....

root@generic:~ # uname -v
FreeBSD 14.1-RELEASE releng/14.1-n267679-10e31f0946d8 GENERIC freebsd
root@generic:/home # pkg install git
The package management tool is not yet installed on your system.
Do you want to fetch and install it now? [y/N]: y
Bootstrapping pkg from pkg+https://pkg.FreeBSD.org/FreeBSD:14:aarch64/quarterly, please wait...
Certificate verification failed for /CN=pkg.freebsd.org
0020616CE1680000:error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:/usr/src/crypto/openssl/ssl/statem/statem_clnt.c:1890:

.....

pkg: Error fetching https://pkg.FreeBSD.org/FreeBSD:14:aarch64/quarterly/Latest/pkg.txz: Authentication error
A pre-built version of pkg could not be found for your system.
Consider changing PACKAGESITE or installing it from ports: 'ports-mgmt/pkg'.

So it would appear that bootstrapping pkg is boned on releng/14.1 for the Pi at the moment.  I shall have to wait until a snapshot shows up I can grab or this is corrected.
Comment 38 Mark Millard 2024-07-03 17:32:09 UTC
(In reply to karl from comment #36)

I found that pkg+https:// ( so :443 instead of :80 ) was broken
by what pkg -d reported as:

SSL certificate problem: certificate is not yet valid

for its various retries.

I tested using pkg+http:// which worked just fine. (Not that you
would want that technique.)

I'm not sure yours is the same issue.

I've sent in a related note on the lists, quoting your note and
having some material about my problems and testing activity
yesterday.
Comment 39 Dimitry Andric freebsd_committer freebsd_triage 2024-07-03 17:51:05 UTC
Has anybody been able to get a backtrace in gdb for this crash? Preferably with debug symbols?
Comment 40 karl 2024-07-03 18:23:02 UTC
(In reply to Dimitry Andric from comment #39)
With symbols, unfortunately no -- I will have to rebuild with debug symbols and that almost-certainly requires I set up a second build file (since I intentionally size this one to be big enough to hold two operational partitions which can be hot-updated) -- that's not real difficult, but it means another build.

However, it was trivial to take the reproducer, start lldb on /usr/bin/cc and execute it, which did produce the crash and a backtrace -- but with no symbols which I know probably doesn't help much....

(lldb) run  "-cc1" "-triple" "aarch64-unknown-freebsd14.1" "-emit-obj" "-mrelax-all" "-disable-free" "-clear-ast-before-backend" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "S0-encryption.c" "-mrelocation-model" "static" "-mframe-pointer=non-leaf" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-funwind-tables=2" "-target-cpu" "generic" "-target-feature" "+v8a" "-target-feature" "+fp-armv8" "-target-feature" "+neon" "-target-abi" "aapcs" "-debug-info-kind=standalone" "-dwarf-version=4" "-debugger-tuning=gdb" "-fdebug-compilation-dir=/data/HD-MCP" "-fcoverage-compilation-dir=/data/HD-MCP" "-D" "VERSION=\"8.0.0-LocalAuth\"" "-Wstrict-prototypes" "-ferror-limit" "19" "-fno-signed-char" "-fgnuc-version=4.2.1" "-fskip-odr-check-in-gmf" "-fcolor-diagnostics" "-faddrsig" "-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-x" "c" "S0-encryption-a63729.c"
Process 1452 launched: '/usr/bin/cc' (aarch64)
Process 1452 stopped
* thread #1, name = 'cc', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x5f34590)
    frame #0: 0x0000000004f6d008 cc`___lldb_unnamed_symbol121986 + 44
cc`___lldb_unnamed_symbol121986:
->  0x4f6d008 <+44>: ldr    x8, [x8, #0x590]
    0x4f6d00c <+48>: mov    x19, x7
    0x4f6d010 <+52>: mov    x23, x1
    0x4f6d014 <+56>: mov    x26, x4
(lldb) bt
* thread #1, name = 'cc', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x5f34590)
  * frame #0: 0x0000000004f6d008 cc`___lldb_unnamed_symbol121986 + 44
    frame #1: 0x0000000004f61be8 cc`___lldb_unnamed_symbol121947 + 516
    frame #2: 0x000000000507d930 cc`___lldb_unnamed_symbol124915 + 156
    frame #3: 0x00000000050727c0 cc`___lldb_unnamed_symbol124849 + 1072
    frame #4: 0x0000000005078af8 cc`___lldb_unnamed_symbol124868 + 1196
    frame #5: 0x0000000005071c84 cc`___lldb_unnamed_symbol124835 + 844
    frame #6: 0x000000000432ced8 cc`___lldb_unnamed_symbol92111 + 660
    frame #7: 0x000000000476416c cc`___lldb_unnamed_symbol102119 + 560
    frame #8: 0x0000000004769624 cc`___lldb_unnamed_symbol102163 + 60
    frame #9: 0x0000000004764708 cc`___lldb_unnamed_symbol102121 + 908
    frame #10: 0x00000000028497c8 cc`___lldb_unnamed_symbol22564 + 6600
    frame #11: 0x000000000285ba64 cc`___lldb_unnamed_symbol22987 + 1500
    frame #12: 0x000000000308411c cc`___lldb_unnamed_symbol45027 + 560
    frame #13: 0x0000000002a97470 cc`___lldb_unnamed_symbol26445 + 104
    frame #14: 0x00000000029f74e8 cc`___lldb_unnamed_symbol25701 + 752
    frame #15: 0x0000000002b438b0 cc`___lldb_unnamed_symbol28197 + 524
    frame #16: 0x0000000001fa022c cc`___lldb_unnamed_symbol584 + 1384
    frame #17: 0x0000000001fad090 cc`___lldb_unnamed_symbol641 + 984
    frame #18: 0x0000000001fac2c8 cc`___lldb_unnamed_symbol640 + 8544
    frame #19: 0x0000000001fa9e9c cc`___lldb_unnamed_symbol638 + 84
    frame #20: 0x000000008dafaf64 libc.so.7`__libc_start1 + 264
    frame #21: 0x0000000001f9fc14 cc`___lldb_unnamed_symbol581 + 28
(lldb)
Comment 41 karl 2024-07-03 20:05:19 UTC
Oh I don't like this one bit....

FreeBSD 14.1-STABLE stable/14-n268064-1a0314d6e305 GENERIC

SAME build (and same rev as above) EXCEPT in the config for the build I have this....

#FREEBSD_WORLD_EXTRA_ARGS="-DWITHOUT_DEBUG_FILES -DWITHOUT_KERNEL_SYMBOLS -DWITH
OUT_TESTS"
FREEBSD_WORLD_EXTRA_ARGS="-DWITH_DEBUG_FILES -DWITHOUT_KERNEL_SYMBOLS -DWITHOUT_
TESTS"

and
"option StripDebug" commented out, which removes the test and debugging directories from the object directory before the install.

This produces a much larger image of course (which I had to make room for), but allegedly is otherwise the same ex including the debug files, and the source revision has not changed as you can see.

But this DOES NOT crash, either on a clean make nor does the reproducer blow up when copied in and run in a while loop for quite some time.

/lib/libc.so.7 (where it SEGVs on the stripped one) is NOT the same between the two builds:

The running one (that does not crash)
root@rpi:/lib # ls -al libc.so*
-r--r--r--  1 root wheel 1862984 Jul  3 19:21 libc.so.7
root@rpi:/lib #
root@rpi:/lib # file libc.so.7
libc.so.7: ELF 64-bit LSB shared object, ARM aarch64, version 1 (FreeBSD), dynamically linked, for FreeBSD 14.1 (1401501), stripped

And a mount of the image file without debugging and tests on my build system so I can look at it, which does crash:
root@NewFS:/mnt/lib # ls -al libc.so.7
-r--r--r--  1 root wheel 1862776 Jul  2 21:55 libc.so.7
root@NewFS:/mnt/lib # file libc.so.7
libc.so.7: ELF 64-bit LSB shared object, ARM aarch64, version 1 (FreeBSD), dynamically linked, for FreeBSD 14.1 (1401501), stripped

Both built on the exact same hardware and OS (I have changed nothing other between the two builds on the host doing them.)

Kernel checksums are different (same size) as well.

Now I'm scratching my head where I was trying to get a symbol-included traceback.....
Comment 42 Mark Millard 2024-07-03 23:24:36 UTC
(In reply to karl from comment #41)

Your description said, in part:

#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)
#1 0x0000000004b15650 (/usr/bin/cc+0x4b15650)
#2 0x0000000004ae16a0 (/usr/bin/cc+0x4ae16a0)
#3 0x000000008a02eeb8 (/lib/libthr.so.3+0x2aeb8)

Note that the SEGV happened for:

#0 0x0000000004b17588 (/usr/bin/cc+0x4b17588)

By contrast,

#1 called somewhere that later got to:
0x0000000004b17588 (/usr/bin/cc+0x4b17588)

#2 called somewhere that later got to:
0x0000000004b15650 (/usr/bin/cc+0x4b15650)

#3 called somewhere that later got to:
0x0000000004ae16a0 (/usr/bin/cc+0x4ae16a0)

The SEGV happened in /usr/bin/cc code instead of in /lib/libthr.so.3 .

There is no direct evidence here of /lib/libc.so.7 being involved.

I'll also note that when the compiler has internal checks enabled,
it aborts on the check failures via deliberately causing a SEGV as I
remember. When such happens, the SEGV of itself is not an indication
of the actual problem. It just can lead to a *.core file possibly
being generated.
Comment 43 karl 2024-07-04 14:05:01 UTC
(In reply to Mark Millard from comment #42)

Yes, but it it appears that libthr just happened to be what called the code that blew up, and libthr was not the actual blowup.

The backtrace I captured showed it coming out of libc.so.7 somewhere but that too is not causal as that's the only frame with a library reference and is well-down the stack.

But here's the nasty on this -- you can't reproduce it with the official plus "rolled forward" pkgbase, and this now makes sense.  Last night I left a build running that had one line changed in the config file, since I was able to reproduce a build that didn't crash.  I whittled that down to this:

#
# This should be on normally, off if you want to be able to debug
#FREEBSD_WORLD_EXTRA_ARGS="-DWITHOUT_DEBUG_FILES -DWITHOUT_KERNEL_SYMBOLS -DWITH
OUT_TESTS"
FREEBSD_WORLD_EXTRA_ARGS="-DWITH_DEBUG_FILES -DWITHOUT_KERNEL_SYMBOLS -DWITHOUT_
TESTS"

With the buildworld being the top line (no debug files, symbols or tests) the resulting build crashes when compiling that source.  With the second line, changing *only* whether the debug files are *built* I cannot reproduce the crash.  I deliberately left stripping them before install on in this build so as to have exactly one variable between the two -- and the "with debug files" option does change the size and thus contents of libc.so.7; the file is different between those two as in my previous comment.

Incidentally /etc/src.conf on my build system, which dates its last change to 2015 (yeah, that volume has been around for a while ;-)) has "WITHOUT_DEBUG_FILES=yes" as its only line.

The default that appears in the generated code appears to be "KERNCONF=GENERIC" alone (that's what I find in the build-generated /etc/src.conf and I did not explicitly put anything there) and the stock RPI 14.1 img file that I grabbed, and which compiles the code cleanly, has no /etc/src.conf file at all.
Comment 44 Mark Millard 2024-07-04 16:43:58 UTC
(In reply to karl from comment #43)

FREEBSD_WORLD_EXTRA_ARGS is not involved in the standard FreeBSD
build environment (make files, scripts). It is Crochet specific.
I do not know what Crochet does generally.

I cannot tell for sure what status you are ending up with for (from
man src.conf output):

     WITHOUT_ASSERT_DEBUG
             Compile programs and libraries without the assert(3) checks.

     WITHOUT_LLVM_ASSERTIONS
             Disable debugging assertions in LLVM.

     WITHOUT_PTHREADS_ASSERTIONS
             Disable debugging assertions in pthreads library.

If any of the assertions are enabled in some type(s) of your builds,
there is the possibility that they are involved in why things stop.
(And stopping may well be valid in such a case.)

Official release and snapshot builds do not have the assertions
enabled as I understand. Official main [so 15 for now] builds do
have them enabled by default. (All 3 categories in each case.)
But that status is not automatic in personal builds: the default
is to have the assertions unless explicitly indicated otherwise.
(That is why "man src.conf" only lists the WITHOUT_ form,
indicating how to change away from the default.)

Crochet might sometimes leave the 3 at defaults, in which case
such builds would have assertions.

If such is the case, likely you are hitting one or more failed
assertions sometimes in your builds but the official builds do
not have the assertion at all.
Comment 45 karl 2024-07-04 17:22:12 UTC
(In reply to Mark Millard from comment #44)
Crochet doesn't do anything "odd"; it simply sets the target architecture and, if you added things in the "Extra" environment variables, it appends those:

freebsd_buildworld ( ) {
if [ -z ${WORLDJOBS} ]; then
        WORLDJOBS="-j $(sysctl -n hw.ncpu)"
else
        WORLDJOBS="-j${WORLDJOBS}"
fi
    _FREEBSD_WORLD_ARGS="TARGET_ARCH=${TARGET_ARCH} SRCCONF=${SRCCONF} __MAKE_CO
NF=${__MAKE_CONF} ${FREEBSD_EXTRA_ARGS} ${FREEBSD_WORLD_EXTRA_ARGS} ${FREEBSD_WO
RLD_BOARD_ARGS}"
    if [ -n "${TARGET_CPUTYPE}" ]; then
        _FREEBSD_WORLD_ARGS="TARGET_CPUTYPE=${TARGET_CPUTYPE} ${_FREEBSD_WORLD_A
RGS}"
    fi
    CONF=${TARGET_ARCH}
    echo make ${_FREEBSD_WORLD_ARGS} ${FREEBSD_BUILDWORLD_EXTRA_ARGS} ${FREEBSD_
BUILDWORLD_BOARD_ARGS} "$@" ${WORLDJOBS} buildworld > ${WORKDIR}/_.buildworld.${
CONF}.sh
    if [ -n "${FREEBSD_FORCE_BUILDWORLD}" ]; then
        rm -f ${WORKDIR}/_.built-world.${CONF}
    fi
    _freebsd_build world ${CONF}
}

The config file for the Pi board (I set one up for both 3 and 4 after looking at what the official builds stuck in the EFI partition and how it differed from using "different for each") doesn't override anything interesting either -- all it does is set up the correct disk layout and deal with the EFI partition requirements.

I'd use nanobsd from the formal tools directory instead (I DO use it for AMD64 embedded stuff) but its broken for the Pis and an inquiry on the list a good while back noted that it had time-rotted -- thus I've kept using what has worked pretty-much forever.

If the default (absent an intentional override) is to include the asserts then they're on as I have definitely not overridden said default and a check of the Crochet build scripts doesn't include anything related to that at all -- the only "WITHOUTs" are in that "EXTRA" line.  I'm well-aware of additional asserts being enabled on -HEAD but that's not a factor here as I'm on -STABLE.

The evidence at this point appears to point to cross-compilation with the 3-line patch in on BOTH the building system and target's output, *and* with "-DWITHOUT_DEBUG_FILES" producing a compiler on the target that, on a Pi3 processor BUT NOT the 4, runs into some sort of alignment problem in some cases and thus blows up.  Given that its not 100% consistent (type "make" enough, or run the reproducer enough and it will complete and produce a .o, and the produced code does run without incident that I can detect if the compile completes) that implies the issue is non-deterministic behavior which could implicate multiple threads, running on different CPUs, that vary in behavior due to other non-related things  executing on the machine at the time that change the cpu affility of the thread(s), the ordering and alignment of elements in memory that are accessed and the order in which that all occurs.  If that's correct it makes for a nasty problem to run down to its root cause -- and raises the possibility that it will show up in other contexts, including some that could corrupt data or panic the system, without explanation or warning.

Change "-DWITHOUT_DEBUG_FILES" to "-DWITH_DEBUG_FILES" -- and nothing else -- and it doesn't blow up.  But interestingly-enough that change, which only controls building and installing the standalone debug files *and thus should not change the basic shared libraries at all* does result in at least /lib/libc.so.7 (and I presume others) being DIFFERENT than if if the switch is set the other way.

Unless I'm misunderstanding the documentation in src.conf(7) -DWITHOUT_DEBUG_FILES should not implicate the actual running binaries and libraries (as opposed to the standalone debugging ones that this option builds if set the other way.)

I would not expect the "normal" (non-debugging) operational shared libraries in /lib to be different based on that switch at compilation -- but they are, and so is the behavior.
Comment 46 Mark Millard 2024-07-04 17:38:51 UTC
(In reply to Mark Millard from comment #44)

I got the release and stable default assertion status wrong
overall: LLVM for pkgbase stable/14 is different vs. for main.
I booted a (pkgbase) stable/14 system and "man src.conf" shows:

     WITHOUT_ASSERT_DEBUG
             Compile programs and libraries without the assert(3) checks.

     WITH_LLVM_ASSERTIONS
             Enable debugging assertions in LLVM.

     WITHOUT_PTHREADS_ASSERTIONS
             Disable debugging assertions in pthreads library.

so LLVM assertions are disabled by default in the standard
way of building the world on stable/14. The other 2 categories
are enabled by default. (Crochet might do its own thing for
some of this in some types of contexts for all I know.)

The system context was the pkgbase'd upgrade to:

# uname -apKU
FreeBSD generic 14.1-STABLE FreeBSD 14.1-STABLE stable/14-n268059-536a452cc4e3 GENERIC arm64 aarch64 1401501 1401501

I still am unsure just what your various builds have as the
status for the 3 categories of assertions. Variations in the
status for LLVM assertions might be a possible explanation.
Comment 47 karl 2024-07-04 17:46:13 UTC
(In reply to Mark Millard from comment #46)
Crochet is not setting anything beyond what the defaults are for the build machine, which in this case is stable/14.

FreeBSD 14.1-STABLE stable/14-n267965-1c27279ed22d KSD-SMP

(The custom kernel is there because I have the PPS HZ and hard enables in the config; the kernel is otherwise GENERIC)

And as noted /etc/src.conf on the build machine only has one line in it which I've had there for nearly 10 years -- WITHOUT_DEBUG_FILES=yes, which apparently is the differentiating factor and why I ran into this in the first place (as that is NOT the default.)

Crochet does not set anything explicitly other than what is in the "EXTRA" flags and, of course, whatever is in /etc/src.conf on the build system.  It sets the target architecture, of course as you'd expect for a cross-compile.  If I negate the /etc/src.conf WITHOUT_DEBUG_FILES setting via the Crochet config files, thus effectively being the default settings, then what I get does not crash.  I've had the reproducer script running in a "while" loop now for over an hour without any failures.
Comment 48 Mark Millard 2024-07-04 18:07:58 UTC
(In reply to karl from comment #45)

Unfortunately WITHOUT_DEBUG_FILES vs. WITH_DEBUG_FILES changes
the memory layout details. For example, there is content in
the output indicating how to find the debug files for
WITH_ but not for WITHOUT_ (if I understand right).

This would mean that if there is code referencing memory
incorrectly, the results can be very different for the bad
references. The sometimes (but rarely) works vs. usually
fails suggests bad references as a possibility.

Having assertions enabled vs. disable differently would also
change the context for bad references and so might change
the behavior. But forcing the assertions to be enabled might
catch something that could lead to later bad references.

May be force all 3 of:

     WITH_ASSERT_DEBUG
     WITH_LLVM_ASSERTIONS
     WITH_PTHREADS_ASSERTIONS

for both cases:

     WITHOUT_DEBUG_FILES
vs.
     WITH_DEBUG_FILES

and see if the assertions frequently notice anything for
one or both cases?
Comment 49 karl 2024-07-04 18:11:29 UTC
(In reply to Mark Millard from comment #48)
That's a very useful suggestion; I'll run a few tests with it since I now have a "single knob" I can twist that makes it either happen or not happen (at least with the assertions left "as default") and post back with whatever results I obtain.
Comment 50 Mark Millard 2024-07-04 19:00:10 UTC
(In reply to karl from comment #49)

Something else that might be useful is the likes of :

# ln -s 'junk:true' /etc/malloc.conf
or:
# ln -s 'junk:true,abort:true' /etc/malloc.conf

That last looks like the following via ls :

# ls -Tld /etc/malloc.conf
lrwxr-xr-x  1 root  wheel  20 Feb  4 03:47:13 2022 /etc/malloc.conf -> junk:true,abort:true

The "junk:true" is a means of having jemalloc fill allocated
memory with 0xa5 on allocation and 0x5a on deallocation (when
jemalloc is built to allow such).

I assume no use of WITH_MALLOC_PRODUCTION or such that might
disable the junk handling.

This is likely something that you would not want to leave in
place when you are doing things that are not intended for
such testing --especially if already time consuming.
Comment 51 karl 2024-07-05 16:39:10 UTC
Well crap.

Enabling assertions appears to change alignment enough that it doesn't blow that way EITHER *and* the resulting /lib/libc.so.7 is the same size as the one that DOES blow up, implying that while libc might be where it blows the reason it blows isn't that code, but rather something in it is being called with a bad pointer reference:

root@rpi:/data/HD-MCP # ls -al /lib/libc.so.7
-r--r--r--  1 root wheel 1862776 Jul  5 15:48 /lib/libc.so.7

So it appears that the inclusion of at least one of the assertion tests changes things enough that it does not crash.  This is the build command Crochet passes:

make TARGET_ARCH=aarch64 SRCCONF=/dev/null __MAKE_CONF=/dev/null -DWITHOUT_CLEAN -DWITHOUT_DEBUG_FILES -DWITHOUT_KERNEL_SYMBOLS -DWITHOUT_TESTS -DWITH_ASSERT_DEBUG -DWITH_LLVM_ASSERTIONS -DWITH_PTHREAD_ASSERTIONS -j 12 buildworld

No failures; difference between "crashes" and "not crashes" is "-DWITH_ASSERT_DEBUG -DWITH_LLVM_ASSERTIONS -DWITH_PTHREAD_ASSERTIONS"
Comment 52 karl 2024-07-05 18:32:59 UTC
Ok, just to confirm I shut those three options back off and now I get the crashes again.

Only difference (same source tree, object directory cleared each build, etc.) in the world build is this:

make TARGET_ARCH=aarch64 SRCCONF=/dev/null __MAKE_CONF=/dev/null -DWITHOUT_CLEAN -DWITHOUT_DEBUG_FILES -DWITHOUT_KERNEL_SYMBOLS -DWITHOUT_TESTS -j 12 buildworld

Garf; this clearly makes reproduction with enough context to determine the reason difficult....
Comment 53 Mark Millard 2024-07-05 18:45:29 UTC
(In reply to karl from comment #52)

Using -DWITHOUT_CLEAN instead of META_MODE or from scratch builds
is asking for problems with things not being rebuilt that should
be.

You might want to try start from scratch forms of build. (Such
would be appropriate for starting META_MODE use for rebuilds as
well.)

If you have not done a from scratch form of build for your normally
intended options, you might want to see if such avoids the problems
in what is produced.
Comment 54 karl 2024-07-05 18:48:17 UTC
(In reply to Mark Millard from comment #53)
To eliminate any possibility of this (which I am quite-aware of from partial attempts burning me with unpredictable results in the past) the object directory has been subject to an "rm -rf /object-directory/*" before EACH build.  The cost of this of course is roughly 1 hour for each build run since all start from an empty object directory.
Comment 55 karl 2024-07-08 01:42:11 UTC
(In reply to karl from comment #54)
-DWITH_ASSERT_DEBUG (but not the other two) does crash - but produces no backtrace or debugging messages I can discern, thus isn't of help.

-DWITH_LLVM_ASSERTIONS does not crash (alone)

(With all three does not crash either as noted above)

Will run with -DWITH_PTHREAD_ASSERTIONS alone as well tomorrow just to check the third one individually.
Comment 56 Mark Millard 2024-07-08 04:16:36 UTC
(In reply to karl from comment #55)

For a src.conf context I'd recommend:

# Avoid stripping but do not control host -g status as well:
DEBUG_FLAGS+=

which just makes sure DEBUG_FLAGS is defined without changing its
defined content (if any).

It is part of show I do my personal system builds.
Comment 57 karl 2024-07-08 11:36:11 UTC
This does crash -- but produces no assertion:

FREEBSD_WORLD_EXTRA_ARGS="-DWITHOUT_DEBUG_FILES -DWITHOUT_KERNEL_SYMBOLS -DWITHO
UT_TESTS -DWITH_PTHREAD_ASSERTIONS"

Including -DWITH_LLVM_ASSERTIONS causes it to not crash, nor does building with the debug files (even though not in use that *does* change the size and contents of system libraries, including /lib/libc.so.7) -- the other two apparently do not impact the code path that is involved in the crashes.