I am compiling lang/ghc with CPUTYPE?=native set. The ghc-pkg executable produced during build gets broken. During staging step, the following command is run /wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/bin/ghc-pkg --force --global-package-db "/wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/package.conf.d" update rts/dist/package.conf.install but it fails with SIGBUS. Debugging under LLDB gives (lldb) run Process 29833 stopped * thread #1, name = 'ghc-pkg', stop reason = signal SIGBUS: hardware error frame #0: 0x0000000804e476c6 libHSrts-ghc8.0.2.so`initGcThreads [inlined] new_gc_thread(n=0) at GC.c:818 815 ws->todo_q = newWSDeque(128); 816 ws->todo_overflow = NULL; 817 ws->n_todo_overflow = 0; -> 818 ws->todo_large_objects = NULL; 819 820 ws->part_list = NULL; 821 ws->n_part_blocks = 0; (lldb) disas -F intel -> 0x804e476c6 <+326>: vmovaps ymmword ptr [rbx + 0x40], ymm0 The value of rbx + 0x40 mod 16 = 0, but rbx + 0x40 mod 32 = 16, which is a source of the error, I suspect. Steps to reproduce: # echo CPUTYPE?=native >> /usr/local/etc/poudriere.d/<yourjail>-<yourports>-make.conf # poudriere testport -j yourjail -p yourports -I lang/ghc Once that fails, enter the jail and # cd /wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2 # lldb -- /wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/bin/ghc-pkg --force --global-package-db "/wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/package.conf.d" update rts/dist/package.conf.install The same bug is present on 11.1-RELEASE too.
I assume this is occurring on i386? And what actual CPU type do you have? If you want to see what clang auto-detects, run: clang -v -march=native -c -x c /dev/null -o /dev/null 2>&1 | grep target-cpu and look for the -target-cpu option.
(In reply to Dimitry Andric from comment #1) Nope, I'm on amd64: "/usr/bin/clang" -cc1 -triple x86_64-unknown-freebsd12.0 <...> -target-cpu skylake <...>
Unfortunately I do not have skylake or AVX2 capable hardware at this moment, and -march=native on my IvyBridge machine works just fine (i.e, no SIGBUS). Having this reduced to a somewhat smaller test case would be nice...
(In reply to arrowd from comment #2) > Nope, I'm on amd64: > > "/usr/bin/clang" -cc1 -triple x86_64-unknown-freebsd12.0 <...> -target-cpu > skylake <...> Is it possible for you to figure out how GC.c is compiled on your system by the ghc build process, and then manually run the same command, adding "-v -save-temps" ? Then please put the .c, .ii, .s and .o files in a tarball, together with a log of the full compiler output (e.g. the intermediate command lines that it shows via -v), and attach that.
Created attachment 191205 [details] Tarball with requested files Here is the tarball with files you requested.
&the_gc_thread is cast to gc_thread *t. new_gc_thread((gc_thread *)&the_gc_thread) => ws = &t->gens[...] ws is gen_workspace, which is __aligned(64) (supposedly). (This is why Clang is able to generate the aligned AVX op.) Neither gc_thread nor gc_thread::gens are tagged with any explicit alignment constraint. the_gc_thread is declared as 'StgWord8 the_gc_thread[sizeof(gc_thread) + 64 * sizeof(gen_workspace)];', which has no alignment requirements. That's the problem. This is bogus code in GHC.
Fixed upstream: https://ghc.haskell.org/trac/ghc/ticket/15482 Thanks cem for your analysis.
(In reply to Gleb Popov from comment #7) Happy to help! It was a fun puzzle and you did all the hard work for me :-).
(In reply to Gleb Popov from comment #7) Please note that the upstream GHC fix is incorrect and continue to follow-up with them about that. Their change c6cc93bca only aligns the array to W_ aka StgWord aka StgWord64 aka unsigned long (8 bytes). This is insufficient for AVX2 alignment[1] (16 bytes for xmm, 32 for ymm) and still violates the guarantee attached to the gen_workspace structure (64 byte alignment). They need to remove the 64-byte gen_workspace alignment or add 64-byte alignment to the array to remove their UB. (They could align both to the smaller 32 bytes and still allow the compiler to take advantage of AVX2.) I don't know what lead them to believe an 8-byte alignment would fix an unaligned 32-byte AVX access. [1]: https://www.felixcloutier.com/x86/MOVAPS.html