Bug 225229 - devel/hwloc broken on 11.1-RELEASE
Summary: devel/hwloc broken on 11.1-RELEASE
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Some People
Assignee: Mark Felder
URL:
Keywords:
: 221946 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-01-16 18:09 UTC by Jason W. Bacon
Modified: 2019-10-13 12:20 UTC (History)
5 users (show)

See Also:
linimon: maintainer-feedback? (phd_kimberlite)


Attachments
Unified diff (3.16 KB, patch)
2018-01-17 14:00 UTC, Jason W. Bacon
no flags Details | Diff
Unified diff (3.38 KB, patch)
2018-01-17 16:27 UTC, Jason W. Bacon
no flags Details | Diff
objdump from clang build (129.49 KB, text/plain)
2018-02-18 17:23 UTC, Jason W. Bacon
no flags Details
objdump from gcc build (122.27 KB, text/plain)
2018-02-18 17:24 UTC, Jason W. Bacon
no flags Details
patch (1.98 KB, patch)
2018-02-19 17:53 UTC, Tijl Coosemans
no flags Details | Diff
patch2 (1.74 KB, patch)
2018-02-22 14:35 UTC, Tijl Coosemans
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jason W. Bacon freebsd_committer freebsd_triage 2018-01-16 18:09:02 UTC
hwloc-info and other hwloc binaries crash on startup on 11.1-RELEASE.

This problem also causes slurmd to crash.

It occurs consistently on some machines and in some cases not on an identical machine.

This issue is probably the root cause of the following openmpi issue:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221946

1.11.1 seems to work fine.  I have not tried anything between 1.11.1 and 1.11.7, but the openmpi PR below suggests 1.11.3 as the cutoff.  We might want to consider rolling back to that version until the root cause is determined.

Compiling with -O0 helps on some systems, but not others.  -O1 also helps on some systems.

Upstream developer suggested a clang bug.  It seems unlikely to me that a code generator bug of this sort could go undetected this long, but given the unusual things hwloc does, it should be considered a possibility.

https://github.com/open-mpi/hwloc/issues/282#issuecomment-357539630

Compiling with gcc seems to help, but that may only mean that gcc arranges variables in a different way that prevents an hwloc bug from causing critical corruption.
Comment 1 Jason W. Bacon freebsd_committer freebsd_triage 2018-01-17 14:00:40 UTC
Created attachment 189836 [details]
Unified diff

I would suggest the attached patch reverting to 1.11.2 to unbreak the port until the upstream code can be fixed, assuming no dependent ports require a later version at this time.  It looks like this might take some time to figure this out.  The upstream developers are stumped at the moment.
Comment 2 Jason W. Bacon freebsd_committer freebsd_triage 2018-01-17 16:27:33 UTC
Created attachment 189839 [details]
Unified diff

Add comment to warn about issues with later versions.
Comment 3 Jason W. Bacon freebsd_committer freebsd_triage 2018-01-17 16:33:57 UTC
Another possible (unattractive) stop-gap is to add USE_GCC.

hwloc uses only C, so this should not cause any ABI headaches.
Comment 4 Anonymized Account freebsd_committer freebsd_triage 2018-02-02 13:40:52 UTC
*** Bug 221946 has been marked as a duplicate of this bug. ***
Comment 5 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-02 19:29:34 UTC
Do you compile with special CFLAGS or CPUTYPE in /etc/make.conf (or elsewhere)?
Can you attach the output of "objdump -d /usr/ports/devel/hwloc/work/hwloc-1.11.7/src/.libs/.libs/topology-x86.o"?
Comment 6 Mark Felder freebsd_committer freebsd_triage 2018-02-03 12:58:39 UTC
I already committed a change to this port to use GCC which has resolved the issue for now
Comment 7 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-03 13:54:16 UTC
Reopen, USE_GCC is only a stopgap measure.
Comment 8 Jason W. Bacon freebsd_committer freebsd_triage 2018-02-03 14:03:31 UTC
Might I suggest:

Index: Makefile
===================================================================
--- Makefile	(revision 460647)
+++ Makefile	(working copy)
@@ -23,6 +23,7 @@
 		--disable-gl
 INSTALL_TARGET=	install-strip
 USES=		iconv libtool pathfix pkgconfig tar:bzip2
+# Builds fine with clang, but crashes since 1.11.3
 USE_GCC=	any
 USE_LDCONFIG=	yes
 USE_GNOME=	libxml2

Otherwise, people will be left wondering why USE_GCC is in place and be tempted to revert it.
Comment 9 commit-hook freebsd_committer freebsd_triage 2018-02-04 13:22:06 UTC
A commit references this bug:

Author: feld
Date: Sun Feb  4 13:21:38 UTC 2018
New revision: 460928
URL: https://svnweb.freebsd.org/changeset/ports/460928

Log:
  devel/hwloc: Add comment to Makefile to clarify why GCC is in use

  PR:		225229

Changes:
  head/devel/hwloc/Makefile
Comment 10 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-17 14:44:37 UTC
Jason, can you provide the information I asked for in comment 5?
Comment 11 Jason W. Bacon freebsd_committer freebsd_triage 2018-02-18 17:23:06 UTC
Created attachment 190749 [details]
objdump from clang build

objdump -d work/hwloc-1.11.7/src/.libs/topology-x86.o > objdump-clang1
Comment 12 Jason W. Bacon freebsd_committer freebsd_triage 2018-02-18 17:24:10 UTC
Created attachment 190750 [details]
objdump from gcc build

objdump -d work/hwloc-1.11.7/src/.libs/topology-x86.o > objdump-gcc
Comment 13 Jason W. Bacon freebsd_committer freebsd_triage 2018-02-18 17:26:07 UTC
Let me know if there's anything else I can provide.  I'd dig into the code myself, but I'm swamped with other things right now.
Comment 14 Jason W. Bacon freebsd_committer freebsd_triage 2018-02-19 14:15:45 UTC
I tweaked the port in my wip collection (See hwloc-debug at https://github.com/outpaddling/freebsd-ports-wip) to build with symbols and optimizations.  Just using -DWITH_DEBUG turns off optimizations, which eliminates the crash.

With the modified port, I was able to get a good backtrace.

Not sure this will lead us directly to the problem, as the developer has stated the problem moves around mysteriously with various changes.  I suspect something prior to the crash is planting a land mine in the stack or elsewhere, so adding a debug printf() or changing optimization level moves or dodges the issue.

The backtrace at least gives us a point from which we can work backwards...

root@compute-001:/usr/ports/wip/hwloc-debug # /usr/local/bin/gdb hwloc-info hwloc-info.core
GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from hwloc-info...done.
[New LWP 100377]
Core was generated by `hwloc-info'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000008073c7000 in ?? ()
(gdb) where
#0  0x00000008073c7000 in ?? ()
#1  0x000000080084f61b in look_procs (backend=0x8020360c0, 
    fulldiscovery=<optimized out>, highest_cpuid=<optimized out>, 
    highest_ext_cpuid=2147483656, cpuid_type=intel, infos=<optimized out>, 
    features=<optimized out>, get_cpubind=<optimized out>, 
    set_cpubind=<optimized out>) at topology-x86.c:932
#2  hwloc_look_x86 (backend=0x8020360c0, fulldiscovery=<optimized out>)
    at topology-x86.c:1091
#3  0x000000080084f037 in hwloc_x86_discover (backend=0x8020360c0)
    at topology-x86.c:1162
#4  0x0000000800833d51 in hwloc_discover (topology=<optimized out>)
    at topology.c:2537
#5  hwloc_topology_load (topology=0x802031000) at topology.c:3038
#6  0x00000000004027ad in main (argc=0, argv=0x7fffffffeb50)
    at hwloc-info.c:490
Comment 15 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-19 17:53:31 UTC
Created attachment 190813 [details]
patch

The code produced by clang looks a bit strange around the cpuid instructions, but I don't immediately see where it might be wrong.  In any case, the attached patch simplifies the cpuid asm and clang produces much cleaner code with it, so please give it a try.
Comment 16 Jason W. Bacon freebsd_committer freebsd_triage 2018-02-20 16:34:34 UTC
Added the patch to 

https://github.com/outpaddling/freebsd-ports-wip/tree/master/hwloc-debug

root@login:/usr/ports/wip/hwloc-debug # hwloc-info
hwloc verbose debug enabled, may be disabled with HWLOC_DEBUG_VERBOSE=0 in the environment.


 * CPU cpusets *

cpu 0 (os 0) has cpuset 0x00000001
cpu 1 (os 1) has cpuset 0x00000002
cpu 2 (os 2) has cpuset 0x00000004
cpu 3 (os 3) has cpuset 0x00000008
Machine#0(local=8352104KB total=0KB Backend=FreeBSD OSName=FreeBSD OSRelease=11.1-RELEASE-p4 OSVersion="FreeBSD 11.1-RELEASE-p4 #0: Tue Nov 14 06:12:40 UTC 2017     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC" HostName=login.wren.hpc.uwm.edu Architecture=amd64) cpuset 0xf...f complete 0x0000000f online 0xf...f allowed 0xf...f nodeset 0x0 completeN 0x0 allowedN 0xf...f
  PU#0 cpuset 0x00000001
  PU#1 cpuset 0x00000002
  PU#2 cpuset 0x00000004
  PU#3 cpuset 0x00000008
Backend x86 forcing a reconnect of levels
--- PU level has number 1

hwloc verbose debug enabled, may be disabled with HWLOC_DEBUG_VERBOSE=0 in the environment.
highest cpuid a, cpuid type 0
highest extended cpuid 80000008
binding to CPU0
Segmentation fault (core dumped)

root@login:/usr/ports/wip/hwloc-debug # /usr/local/bin/gdb hwloc-info hwloc-info.core 
GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from hwloc-info...done.
[New LWP 100704]
Core was generated by `hwloc-info'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000805238000 in ?? ()
(gdb) where
#0  0x0000000805238000 in ?? ()
#1  0x000000080084f5db in look_procs (backend=0x80203d0c0, 
    fulldiscovery=<optimized out>, highest_cpuid=<optimized out>, 
    highest_ext_cpuid=2147483656, cpuid_type=intel, infos=<optimized out>, 
    features=<optimized out>, get_cpubind=<optimized out>, 
    set_cpubind=<optimized out>) at topology-x86.c:932
#2  hwloc_look_x86 (backend=0x80203d0c0, fulldiscovery=<optimized out>)
    at topology-x86.c:1091
#3  0x000000080084f037 in hwloc_x86_discover (backend=0x80203d0c0)
    at topology-x86.c:1162
#4  0x0000000800833d51 in hwloc_discover (topology=<optimized out>)
    at topology.c:2537
#5  hwloc_topology_load (topology=0x802038000) at topology.c:3038
#6  0x00000000004027ad in main (argc=0, argv=0x7fffffffeac8)
    at hwloc-info.c:490
Comment 17 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-22 14:35:04 UTC
Created attachment 190896 [details]
patch2

Second attempt.  There is a buffer overrun in the cpuset_getid syscall in FreeBSD 11.1 (already fixed in stable/11 and head).  The attached patch should work around that.  Please give it a try.
Comment 18 Jason W. Bacon freebsd_committer freebsd_triage 2018-02-22 17:40:32 UTC
It's working now, with and without a symbol table, with -O0, -O1, -O2, or -O3:

===>   Registering installation for hwloc-1.11.7
Installing hwloc-1.11.7...
root@login:/usr/ports/wip/hwloc-debug # hwloc-info
depth 0:	1 Machine (type #1)
 depth 1:	2 Package (type #3)
  depth 2:	2 L2Cache (type #4)
   depth 3:	4 L1dCache (type #4)
    depth 4:	4 L1iCache (type #4)
     depth 5:	4 Core (type #5)
      depth 6:	4 PU (type #6)
Special depth -3:	6 Bridge (type #9)
Special depth -4:	5 PCI Device (type #10)

Nice work...
Comment 19 commit-hook freebsd_committer freebsd_triage 2018-02-22 18:02:12 UTC
A commit references this bug:

Author: tijl
Date: Thu Feb 22 18:02:02 UTC 2018
New revision: 462617
URL: https://svnweb.freebsd.org/changeset/ports/462617

Log:
  Add a patch to work around a buffer overrun in cpuset_getid in FreeBSD 11.1.
  This caused a crash when compiled with Clang due to a different stack
  layout compared to GCC.

  PR:		225229
  See also:	https://github.com/open-mpi/hwloc/issues/282

Changes:
  head/devel/hwloc/Makefile
  head/devel/hwloc/files/
  head/devel/hwloc/files/patch-src_topology-x86.c
Comment 20 commit-hook freebsd_committer freebsd_triage 2019-10-13 12:20:50 UTC
A commit references this bug:

Author: jbeich
Date: Sun Oct 13 12:19:53 UTC 2019
New revision: 514386
URL: https://svnweb.freebsd.org/changeset/ports/514386

Log:
  devel/hwloc: drop FreeBSD 11.1 workaround after r481023

  PR:		225229

Changes:
  head/devel/hwloc/files/