I'm running `make buildkernel' on macOS (cross compiling) and sometimes ctfmerge crashes with a Segmentation fault: 11. The errors are random, and not for the same call. libsystem_pthread.dylib Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00004bbc62118008 Exception Codes: 0x0000000000000001, 0x00004bbc62118008 Termination Reason: Namespace SIGNAL, Code 11 Segmentation fault: 11 This seems to be a problem with the macOS pthread library, or the use of the lib. It happens on arm and intel, and the latest macOS 26. I run `make cleankernel; make buildkernel' in a loop 100 times (ca. 2hours) and usually in 3-5 cases ctfmerge will crash. Example: for i in $(seq 1 100); do printf "."; env logrotate_max=100 debug=0 build_targets='cleankernel;buildkernel' cross-build-macos.sh >/dev/null 2>&1 || echo " i=$i exit=$?";done .......... i=10 exit=2 . i=11 exit=2 .............................................. i=57 exit=2 ....... i=64 exit=2 ............................ i=92 exit=2 ........ sh: line 2: 64617 Segmentation fault: 11 ctfmerge -L VERSION -g -o ipl.kld mlfk_ipl.o ip_nat.o ip_frag.o ip_state.o ip_proxy.o ip_auth.o ip_log.o ip_fil_freebsd.o fil.o ip_lookup.o ip_pool.o ip_htable.o ip_sync.o ip_nat6.o ip_rules.o ip_scan.o ip_dstlist.o radix_ipf.o *** [ipl.kld] Error code 139 zegrep -l 'Segmentation fault' macos.buildkernel*gz macos.buildkernel.35.gz macos.buildkernel.42.gz macos.buildkernel.7.gz macos.buildkernel.88.gz macos.buildkernel.89.gz
Are you able to get a backtrace? Is the crash reproducible if you run ctfmerge with CTFMERGE_MAX_SLOTS=1? I think setting that will force it to use a single thread.
Created attachment 265610 [details] Patch for missing locking around ctfmerge fifo operations I was able to reproduce this issue when run in a loop and then simplified it by just running the cftmerge command in a loop from the last crash. This would fail fairly quickly in a loop to 100. (lldb) bt all thread #1 frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984 frame #2: 0x0000000104eefca0 ctfmerge`main + 1736 frame #3: 0x0000000197541d54 dyld`start + 7184 thread #2 frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8 frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84 frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220 frame #3: 0x0000000104ef05dc ctfmerge`worker_thread + 980 frame #4: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 * thread #3, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x17f5) * frame #0: 0x0000000104ef093c ctfmerge`fifo_len + 16 frame #1: 0x0000000104ef06d4 ctfmerge`worker_thread + 1228 frame #2: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 thread #4 frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984 frame #2: 0x0000000104ef06e8 ctfmerge`worker_thread + 1248 frame #3: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 Fixed the above occurrence by locking around the fifo_len() call and then received this at another location fifo_len() call: (lldb) bt all thread #1 frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984 frame #2: 0x0000000102317ca0 ctfmerge`main(argc=<unavailable>, argv=<unavailable>) at ctfmerge.c:928:3 [opt] frame #3: 0x0000000197541d54 dyld`start + 7184 thread #2 frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8 frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84 frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220 frame #3: 0x000000019790a168 libsystem_pthread.dylib`_pthread_cond_wait + 1124 frame #4: 0x00000001023186f8 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:472:4 [opt] [inlined] frame #5: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt] frame #6: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 thread #3 frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8 frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84 frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220 frame #3: 0x000000019790a168 libsystem_pthread.dylib`_pthread_cond_wait + 1124 frame #4: 0x00000001023186f8 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:472:4 [opt] [inlined] frame #5: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt] frame #6: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 * thread #4, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x2176) * frame #0: 0x000000010231894c ctfmerge`fifo_len + 16 frame #1: 0x00000001023186e4 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:471:7 [opt] [inlined] frame #2: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt] frame #3: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 thread #5 frame #0: 0x00000001978c99c8 libsystem_kernel.dylib`__psynch_mutexwait + 8 frame #1: 0x0000000197906e3c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84 frame #2: 0x0000000197904868 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220 frame #3: 0x0000000102318578 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:532:3 [opt] frame #4: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 thread #6 frame #0: 0x00000001978ca4f8 libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x000000019790a0dc libsystem_pthread.dylib`_pthread_cond_wait + 984 frame #2: 0x00000001023186f8 ctfmerge`worker_runphase2(wq=0x0000000102344968) at ctfmerge.c:472:4 [opt] [inlined] frame #3: 0x0000000102318624 ctfmerge`worker_thread(wq=0x0000000102344968) at ctfmerge.c:544:2 [opt] frame #4: 0x0000000197909c08 libsystem_pthread.dylib`_pthread_start + 136 Fixed the second one and then found another by reviewing all the fifo_*() calls for the attached patch. I ran this twice in a loop to 10000 without an issue. Note to get a core dump on MacOS and lldb backtrace: 1. Change /cores to be writable by the user "chmod 777 /cores" 2. Set core limit "ulimit -c unlimited" 3. codesign the ctfmerge binary to give it a core dump entitlement: /usr/libexec/PlistBuddy -c "Add :com.apple.security.get-task-allow bool true" tmp.entitlements codesign -s - -f --entitlements tmp.entitlements /path/to/ctfmerge Then run lldb: lldb -c /cores/core.<pid> -f /path/to/ctfmerge (lldb) bt all
(In reply to Mark Johnston from comment #1) I get no crashes with CTFMERGE_MAX_SLOTS=1, tried it 300 times. Thanks for the hint!
(In reply to Mark Peek from comment #2) the patch works great, no segfault anymore!
(In reply to Mark Peek from comment #2) I'm confused by this patch. Each worker thread goes through two barriers; the last thread to enter the first barrier executes finalize_phase_one() and init_phase_two(), which are the places you locked. Other threads should be blocked in wq_bar2 while that one thread is executing. That is, nothing should be touching the fifo. What am I missing?
(In reply to Mark Johnston from comment #5) Yes, you are correct about the barriers. I dug into it a bit more and the threads are not waiting in barrier_wait(). This is due to the sema_*() functions for MacOS being deprecated and no error checking on the sema_*() return values in barrier.c. Added debug and received this back from errno on sem_init(): barrier_init: sem_init failed on barrier 0x104a18af8: Function not implemented barrier_init: sem_init failed on barrier 0x104a18b48: Function not implemented and then later on, returning immediately back from sema_wait(): DEBUG: barrier_wait: thread 6097268736 sem_wait failed on barrier 0x104a18af8: Bad file descriptor Options: 1. Force CTFMERGE_MAX_SLOTS=1 on MacOS 2. Implement new sema_*() operations for MacOS (found this: https://stackoverflow.com/questions/27736618/why-are-sem-init-sem-getvalue-sem-destroy-deprecated-on-mac-os-x-and-w) 3. ???
(In reply to Mark Peek from comment #6) Does macOS have pthread_barrier? That is, pthread_barrier_init(), etc.. If so, I think the best thing would be to just modify ctfmerge to use them and get rid of barrier.c.
(In reply to Mark Johnston from comment #7) I do not see pthread_barrier_*() implemented in MacOS. Implementing those could be done instead of the sema_*() route. Question would be if we have any other build utilities that might require one over the other in the hope of reusing any new implementation.
(In reply to Mark Peek from comment #8) I can't see any other places where a new barrier implementation might be useful. Yet another idea: use a condition variable instead of a semaphore. That makes more sense anyway, and macOS surely provides those.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=732b4aa05d78ca6831d02e67a43f34ad104f4f01 commit 732b4aa05d78ca6831d02e67a43f34ad104f4f01 Author: Mark Peek <mp@FreeBSD.org> AuthorDate: 2025-12-01 20:50:24 +0000 Commit: Mark Peek <mp@FreeBSD.org> CommitDate: 2025-12-01 20:50:24 +0000 ctfmerge: fix segfault when building on macOS The barrier code was using semaphores which have been deprecated in macOS and not working at all, causing a race condition. Since macOS does not have pthread_barrier_*(), this change uses a condition variable instead. PR: 290958 Reported by: wosch MFC after: 2 weeks Reviewed by: imp, markj Differential Revision: https://reviews.freebsd.org/D54018 cddl/contrib/opensolaris/tools/ctf/cvt/barrier.c | 26 +++--------------------- cddl/contrib/opensolaris/tools/ctf/cvt/barrier.h | 9 ++------ 2 files changed, 5 insertions(+), 30 deletions(-)
No segfaults anymore after the fix. Thanks.