Bug 290958 - ctfmerge: random Segmentation fault: 11 for `make buildkernel' on macOS
Summary: ctfmerge: random Segmentation fault: 11 for `make buildkernel' on macOS
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 16.0-CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Mark Peek
URL:
Keywords: crash
Depends on:
Blocks: 287194 291186
  Show dependency treegraph
 
Reported: 2025-11-11 19:11 UTC by Wolfram Schneider
Modified: 2025-12-03 17:46 UTC (History)
5 users (show)

See Also:


Attachments
Patch for missing locking around ctfmerge fifo operations (1.25 KB, patch)
2025-11-24 00:19 UTC, Mark Peek
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wolfram Schneider freebsd_committer freebsd_triage 2025-11-11 19:11:25 UTC
I'm running `make buildkernel' on macOS (cross compiling) and sometimes ctfmerge crashes with a Segmentation fault: 11. The errors are random, and not for the same call.

libsystem_pthread.dylib
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00004bbc62118008
Exception Codes: 0x0000000000000001, 0x00004bbc62118008
Termination Reason: Namespace SIGNAL, Code 11 Segmentation fault: 11

This seems to be a problem with the macOS pthread library, or the use of the lib. It happens on arm and intel, and the latest macOS 26.

I run `make cleankernel; make buildkernel' in a loop 100 times (ca. 2hours) and usually in 3-5 cases ctfmerge will crash.


Example:

for i in $(seq 1 100); do printf "."; env logrotate_max=100 debug=0 build_targets='cleankernel;buildkernel' cross-build-macos.sh >/dev/null 2>&1 || echo " i=$i exit=$?";done

.......... i=10 exit=2
. i=11 exit=2
.............................................. i=57 exit=2
....... i=64 exit=2
............................ i=92 exit=2
........

sh: line 2: 64617 Segmentation fault: 11 ctfmerge -L VERSION -g -o ipl.kld mlfk_ipl.o ip_nat.o ip_frag.o ip_state.o ip_proxy.o ip_auth.o ip_log.o ip_fil_freebsd.o fil.o ip_lookup.o ip_pool.o ip_htable.o ip_sync.o ip_nat6.o ip_rules.o ip_scan.o ip_dstlist.o radix_ipf.o
*** [ipl.kld] Error code 139

zegrep -l 'Segmentation fault' macos.buildkernel*gz
macos.buildkernel.35.gz
macos.buildkernel.42.gz
macos.buildkernel.7.gz
macos.buildkernel.88.gz
macos.buildkernel.89.gz
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2025-11-14 17:16:52 UTC
Are you able to get a backtrace?

Is the crash reproducible if you run ctfmerge with CTFMERGE_MAX_SLOTS=1?  I think setting that will force it to use a single thread.
Comment 2 Mark Peek freebsd_committer freebsd_triage 2025-11-24 00:19:56 UTC
Created attachment 265610 [details]
Patch for missing locking around ctfmerge fifo operations

I was able to reproduce this issue when run in a loop and then simplified it by just running the cftmerge command in a loop from the last crash. This would fail fairly quickly in a loop to 100.

(lldb) bt all
  thread #1
    frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame #2: 0x0000000104eefca0 ctfmerge`main + 1736
    frame #3: 0x0000000197541d54 dyld`start + 7184
  thread #2
    frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220
    frame #3: 0x0000000104ef05dc ctfmerge`worker_thread + 980
    frame #4: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136
* thread #3, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x17f5)
  * frame #0: 0x0000000104ef093c ctfmerge`fifo_len + 16
    frame #1: 0x0000000104ef06d4 ctfmerge`worker_thread + 1228
    frame #2: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame #2: 0x0000000104ef06e8 ctfmerge`worker_thread + 1248
    frame #3: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136

Fixed the above occurrence by locking around the fifo_len() call and then received this at another location fifo_len() call:

(lldb) bt all
  thread #1
    frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame #2: 0x0000000102317ca0 ctfmerge`main(argc=<unavailable>, argv=<unavailable>) at ctfmerge.c:928:3 [opt]
    frame #3: 0x0000000197541d54 dyld`start + 7184
  thread #2
    frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220
    frame #3: 0x000000019790a168 libsystem_pthread.dylib`_pthread_cond_wait + 1124
    frame #4: 0x00000001023186f8 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:472:4 [opt] [inlined]
    frame #5: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt]
    frame #6: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220
    frame #3: 0x000000019790a168 libsystem_pthread.dylib`_pthread_cond_wait + 1124
    frame #4: 0x00000001023186f8 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:472:4 [opt] [inlined]
    frame #5: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt]
    frame #6: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136
* thread #4, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x2176)
  * frame #0: 0x000000010231894c ctfmerge`fifo_len + 16
    frame #1: 0x00000001023186e4 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:471:7 [opt] [inlined]
    frame #2: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt]
    frame #3: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220
    frame #3: 0x0000000102318578 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:532:3 [opt]
    frame #4: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame #2: 0x00000001023186f8 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:472:4 [opt] [inlined]
    frame #3: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt]
    frame #4: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136

Fixed the second one and then found another by reviewing all the fifo_*() calls for the attached patch. I ran this twice in a loop to 10000 without an issue.


Note to get a core dump on MacOS and lldb backtrace:
1. Change /cores to be writable by the user "chmod 777 /cores"
2. Set core limit "ulimit -c unlimited"
3. codesign the ctfmerge binary to give it a core dump entitlement:
   /usr/libexec/PlistBuddy -c "Add :com.apple.security.get-task-allow bool true" tmp.entitlements
   codesign -s - -f --entitlements tmp.entitlements /path/to/ctfmerge

Then run lldb:
    lldb -c /cores/core.<pid> -f /path/to/ctfmerge
    (lldb) bt all
Comment 3 Wolfram Schneider freebsd_committer freebsd_triage 2025-11-24 07:40:17 UTC
(In reply to Mark Johnston from comment #1)

I get no crashes with CTFMERGE_MAX_SLOTS=1, tried it 300 times. Thanks for the hint!
Comment 4 Wolfram Schneider freebsd_committer freebsd_triage 2025-11-24 15:21:02 UTC
(In reply to Mark Peek from comment #2)

the patch works great, no segfault anymore!
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2025-11-24 15:55:15 UTC
(In reply to Mark Peek from comment #2)
I'm confused by this patch.  Each worker thread goes through two barriers; the last thread to enter the first barrier executes finalize_phase_one() and init_phase_two(), which are the places you locked.  Other threads should be blocked in wq_bar2 while that one thread is executing.  That is, nothing should be touching the fifo.  What am I missing?
Comment 6 Mark Peek freebsd_committer freebsd_triage 2025-11-24 22:58:06 UTC
(In reply to Mark Johnston from comment #5)
Yes, you are correct about the barriers. I dug into it a bit more and the threads are not waiting in barrier_wait(). This is due to the sema_*() functions for MacOS being deprecated and no error checking on the sema_*() return values in barrier.c.

Added debug and received this back from errno on sem_init():
barrier_init: sem_init failed on barrier 0x104a18af8: Function not implemented
barrier_init: sem_init failed on barrier 0x104a18b48: Function not implemented

and then later on, returning immediately back from sema_wait():
DEBUG: barrier_wait: thread 6097268736 sem_wait failed on barrier 0x104a18af8: Bad file descriptor

Options:
1. Force CTFMERGE_MAX_SLOTS=1 on MacOS
2. Implement new sema_*() operations for MacOS
   (found this: https://stackoverflow.com/questions/27736618/why-are-sem-init-sem-getvalue-sem-destroy-deprecated-on-mac-os-x-and-w)
3. ???
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2025-11-24 23:00:30 UTC
(In reply to Mark Peek from comment #6)
Does macOS have pthread_barrier?  That is, pthread_barrier_init(), etc..  If so, I think the best thing would be to just modify ctfmerge to use them and get rid of barrier.c.
Comment 8 Mark Peek freebsd_committer freebsd_triage 2025-11-24 23:19:19 UTC
(In reply to Mark Johnston from comment #7)
I do not see pthread_barrier_*() implemented in MacOS. Implementing those could be done instead of the sema_*() route. Question would be if we have any other build utilities that might require one over the other in the hope of reusing any new implementation.
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2025-11-25 13:54:08 UTC
(In reply to Mark Peek from comment #8)
I can't see any other places where a new barrier implementation might be useful.

Yet another idea: use a condition variable instead of a semaphore.  That makes more sense anyway, and macOS surely provides those.
Comment 10 commit-hook freebsd_committer freebsd_triage 2025-12-01 20:52:58 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=732b4aa05d78ca6831d02e67a43f34ad104f4f01

commit 732b4aa05d78ca6831d02e67a43f34ad104f4f01
Author:     Mark Peek <mp@FreeBSD.org>
AuthorDate: 2025-12-01 20:50:24 +0000
Commit:     Mark Peek <mp@FreeBSD.org>
CommitDate: 2025-12-01 20:50:24 +0000

    ctfmerge: fix segfault when building on macOS

    The barrier code was using semaphores which have been deprecated in
    macOS and not working at all, causing a race condition. Since macOS
    does not have pthread_barrier_*(), this change uses a condition
    variable instead.

    PR:             290958
    Reported by:    wosch
    MFC after:      2 weeks
    Reviewed by:    imp, markj
    Differential Revision:  https://reviews.freebsd.org/D54018

 cddl/contrib/opensolaris/tools/ctf/cvt/barrier.c | 26 +++---------------------
 cddl/contrib/opensolaris/tools/ctf/cvt/barrier.h |  9 ++------
 2 files changed, 5 insertions(+), 30 deletions(-)
Comment 11 Wolfram Schneider freebsd_committer freebsd_triage 2025-12-02 09:15:22 UTC
No segfaults anymore after the fix. Thanks.