hwloc-info and other hwloc binaries crash on startup on 11.1-RELEASE. This problem also causes slurmd to crash. It occurs consistently on some machines and in some cases not on an identical machine. This issue is probably the root cause of the following openmpi issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221946 1.11.1 seems to work fine. I have not tried anything between 1.11.1 and 1.11.7, but the openmpi PR below suggests 1.11.3 as the cutoff. We might want to consider rolling back to that version until the root cause is determined. Compiling with -O0 helps on some systems, but not others. -O1 also helps on some systems. Upstream developer suggested a clang bug. It seems unlikely to me that a code generator bug of this sort could go undetected this long, but given the unusual things hwloc does, it should be considered a possibility. https://github.com/open-mpi/hwloc/issues/282#issuecomment-357539630 Compiling with gcc seems to help, but that may only mean that gcc arranges variables in a different way that prevents an hwloc bug from causing critical corruption.
Created attachment 189836 [details] Unified diff I would suggest the attached patch reverting to 1.11.2 to unbreak the port until the upstream code can be fixed, assuming no dependent ports require a later version at this time. It looks like this might take some time to figure this out. The upstream developers are stumped at the moment.
Created attachment 189839 [details] Unified diff Add comment to warn about issues with later versions.
Another possible (unattractive) stop-gap is to add USE_GCC. hwloc uses only C, so this should not cause any ABI headaches.
*** Bug 221946 has been marked as a duplicate of this bug. ***
Do you compile with special CFLAGS or CPUTYPE in /etc/make.conf (or elsewhere)? Can you attach the output of "objdump -d /usr/ports/devel/hwloc/work/hwloc-1.11.7/src/.libs/.libs/topology-x86.o"?
I already committed a change to this port to use GCC which has resolved the issue for now
Reopen, USE_GCC is only a stopgap measure.
Might I suggest: Index: Makefile =================================================================== --- Makefile (revision 460647) +++ Makefile (working copy) @@ -23,6 +23,7 @@ --disable-gl INSTALL_TARGET= install-strip USES= iconv libtool pathfix pkgconfig tar:bzip2 +# Builds fine with clang, but crashes since 1.11.3 USE_GCC= any USE_LDCONFIG= yes USE_GNOME= libxml2 Otherwise, people will be left wondering why USE_GCC is in place and be tempted to revert it.
A commit references this bug: Author: feld Date: Sun Feb 4 13:21:38 UTC 2018 New revision: 460928 URL: https://svnweb.freebsd.org/changeset/ports/460928 Log: devel/hwloc: Add comment to Makefile to clarify why GCC is in use PR: 225229 Changes: head/devel/hwloc/Makefile
Jason, can you provide the information I asked for in comment 5?
Created attachment 190749 [details] objdump from clang build objdump -d work/hwloc-1.11.7/src/.libs/topology-x86.o > objdump-clang1
Created attachment 190750 [details] objdump from gcc build objdump -d work/hwloc-1.11.7/src/.libs/topology-x86.o > objdump-gcc
Let me know if there's anything else I can provide. I'd dig into the code myself, but I'm swamped with other things right now.
I tweaked the port in my wip collection (See hwloc-debug at https://github.com/outpaddling/freebsd-ports-wip) to build with symbols and optimizations. Just using -DWITH_DEBUG turns off optimizations, which eliminates the crash. With the modified port, I was able to get a good backtrace. Not sure this will lead us directly to the problem, as the developer has stated the problem moves around mysteriously with various changes. I suspect something prior to the crash is planting a land mine in the stack or elsewhere, so adding a debug printf() or changing optimization level moves or dodges the issue. The backtrace at least gives us a point from which we can work backwards... root@compute-001:/usr/ports/wip/hwloc-debug # /usr/local/bin/gdb hwloc-info hwloc-info.core GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD] Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd11.1". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from hwloc-info...done. [New LWP 100377] Core was generated by `hwloc-info'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000008073c7000 in ?? () (gdb) where #0 0x00000008073c7000 in ?? () #1 0x000000080084f61b in look_procs (backend=0x8020360c0, fulldiscovery=<optimized out>, highest_cpuid=<optimized out>, highest_ext_cpuid=2147483656, cpuid_type=intel, infos=<optimized out>, features=<optimized out>, get_cpubind=<optimized out>, set_cpubind=<optimized out>) at topology-x86.c:932 #2 hwloc_look_x86 (backend=0x8020360c0, fulldiscovery=<optimized out>) at topology-x86.c:1091 #3 0x000000080084f037 in hwloc_x86_discover (backend=0x8020360c0) at topology-x86.c:1162 #4 0x0000000800833d51 in hwloc_discover (topology=<optimized out>) at topology.c:2537 #5 hwloc_topology_load (topology=0x802031000) at topology.c:3038 #6 0x00000000004027ad in main (argc=0, argv=0x7fffffffeb50) at hwloc-info.c:490
Created attachment 190813 [details] patch The code produced by clang looks a bit strange around the cpuid instructions, but I don't immediately see where it might be wrong. In any case, the attached patch simplifies the cpuid asm and clang produces much cleaner code with it, so please give it a try.
Added the patch to https://github.com/outpaddling/freebsd-ports-wip/tree/master/hwloc-debug root@login:/usr/ports/wip/hwloc-debug # hwloc-info hwloc verbose debug enabled, may be disabled with HWLOC_DEBUG_VERBOSE=0 in the environment. * CPU cpusets * cpu 0 (os 0) has cpuset 0x00000001 cpu 1 (os 1) has cpuset 0x00000002 cpu 2 (os 2) has cpuset 0x00000004 cpu 3 (os 3) has cpuset 0x00000008 Machine#0(local=8352104KB total=0KB Backend=FreeBSD OSName=FreeBSD OSRelease=11.1-RELEASE-p4 OSVersion="FreeBSD 11.1-RELEASE-p4 #0: Tue Nov 14 06:12:40 UTC 2017 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC" HostName=login.wren.hpc.uwm.edu Architecture=amd64) cpuset 0xf...f complete 0x0000000f online 0xf...f allowed 0xf...f nodeset 0x0 completeN 0x0 allowedN 0xf...f PU#0 cpuset 0x00000001 PU#1 cpuset 0x00000002 PU#2 cpuset 0x00000004 PU#3 cpuset 0x00000008 Backend x86 forcing a reconnect of levels --- PU level has number 1 hwloc verbose debug enabled, may be disabled with HWLOC_DEBUG_VERBOSE=0 in the environment. highest cpuid a, cpuid type 0 highest extended cpuid 80000008 binding to CPU0 Segmentation fault (core dumped) root@login:/usr/ports/wip/hwloc-debug # /usr/local/bin/gdb hwloc-info hwloc-info.core GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD] Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd11.1". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from hwloc-info...done. [New LWP 100704] Core was generated by `hwloc-info'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000805238000 in ?? () (gdb) where #0 0x0000000805238000 in ?? () #1 0x000000080084f5db in look_procs (backend=0x80203d0c0, fulldiscovery=<optimized out>, highest_cpuid=<optimized out>, highest_ext_cpuid=2147483656, cpuid_type=intel, infos=<optimized out>, features=<optimized out>, get_cpubind=<optimized out>, set_cpubind=<optimized out>) at topology-x86.c:932 #2 hwloc_look_x86 (backend=0x80203d0c0, fulldiscovery=<optimized out>) at topology-x86.c:1091 #3 0x000000080084f037 in hwloc_x86_discover (backend=0x80203d0c0) at topology-x86.c:1162 #4 0x0000000800833d51 in hwloc_discover (topology=<optimized out>) at topology.c:2537 #5 hwloc_topology_load (topology=0x802038000) at topology.c:3038 #6 0x00000000004027ad in main (argc=0, argv=0x7fffffffeac8) at hwloc-info.c:490
Created attachment 190896 [details] patch2 Second attempt. There is a buffer overrun in the cpuset_getid syscall in FreeBSD 11.1 (already fixed in stable/11 and head). The attached patch should work around that. Please give it a try.
It's working now, with and without a symbol table, with -O0, -O1, -O2, or -O3: ===> Registering installation for hwloc-1.11.7 Installing hwloc-1.11.7... root@login:/usr/ports/wip/hwloc-debug # hwloc-info depth 0: 1 Machine (type #1) depth 1: 2 Package (type #3) depth 2: 2 L2Cache (type #4) depth 3: 4 L1dCache (type #4) depth 4: 4 L1iCache (type #4) depth 5: 4 Core (type #5) depth 6: 4 PU (type #6) Special depth -3: 6 Bridge (type #9) Special depth -4: 5 PCI Device (type #10) Nice work...
A commit references this bug: Author: tijl Date: Thu Feb 22 18:02:02 UTC 2018 New revision: 462617 URL: https://svnweb.freebsd.org/changeset/ports/462617 Log: Add a patch to work around a buffer overrun in cpuset_getid in FreeBSD 11.1. This caused a crash when compiled with Clang due to a different stack layout compared to GCC. PR: 225229 See also: https://github.com/open-mpi/hwloc/issues/282 Changes: head/devel/hwloc/Makefile head/devel/hwloc/files/ head/devel/hwloc/files/patch-src_topology-x86.c
A commit references this bug: Author: jbeich Date: Sun Oct 13 12:19:53 UTC 2019 New revision: 514386 URL: https://svnweb.freebsd.org/changeset/ports/514386 Log: devel/hwloc: drop FreeBSD 11.1 workaround after r481023 PR: 225229 Changes: head/devel/hwloc/files/