226059 – Clang produces vmovaps with unaligned operand

Bug 226059 - Clang produces vmovaps with unaligned operand

Summary: Clang produces vmovaps with unaligned operand

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Dimitry Andric

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-02-20 10:34 UTC by Gleb Popov
Modified:	2018-08-08 15:18 UTC (History)
CC List:	3 users (show)

See Also:	229788

Attachments
Tarball with requested files (785.50 KB, application/octet-stream) 2018-03-04 19:23 UTC, Gleb Popov	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gleb Popov freebsd_committer

2018-02-20 10:34:23 UTC

I am compiling lang/ghc with CPUTYPE?=native set. The ghc-pkg executable produced during build gets broken. During staging step, the following command is run

/wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/bin/ghc-pkg --force --global-package-db "/wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/package.conf.d" update rts/dist/package.conf.install

but it fails with SIGBUS. Debugging under LLDB gives

(lldb) run
Process 29833 stopped
* thread #1, name = 'ghc-pkg', stop reason = signal SIGBUS: hardware error
    frame #0: 0x0000000804e476c6 libHSrts-ghc8.0.2.so`initGcThreads [inlined] new_gc_thread(n=0) at GC.c:818
   815          ws->todo_q = newWSDeque(128);
   816          ws->todo_overflow = NULL;
   817          ws->n_todo_overflow = 0;
-> 818          ws->todo_large_objects = NULL;
   819 
   820          ws->part_list = NULL;
   821          ws->n_part_blocks = 0;
(lldb) disas -F intel
->  0x804e476c6 <+326>: vmovaps ymmword ptr [rbx + 0x40], ymm0

The value of rbx + 0x40 mod 16 = 0, but rbx + 0x40 mod 32 = 16, which is a source of the error, I suspect.

Steps to reproduce:

# echo CPUTYPE?=native >> /usr/local/etc/poudriere.d/<yourjail>-<yourports>-make.conf
# poudriere testport -j yourjail -p yourports -I lang/ghc

Once that fails, enter the jail and

# cd /wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2
# lldb -- /wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/bin/ghc-pkg --force --global-package-db "/wrkdirs/usr/ports/lang/ghc/work/stage/usr/local/lib/ghc-8.0.2/package.conf.d" update rts/dist/package.conf.install

The same bug is present on 11.1-RELEASE too.

Comment 1 Dimitry Andric freebsd_committer

2018-02-21 07:35:25 UTC

I assume this is occurring on i386?  And what actual CPU type do you have?

If you want to see what clang auto-detects, run:

clang -v -march=native -c -x c /dev/null -o /dev/null 2>&1 | grep target-cpu

and look for the -target-cpu option.

Comment 2 Gleb Popov freebsd_committer

2018-02-21 08:42:00 UTC

(In reply to Dimitry Andric from comment #1)

Nope, I'm on amd64:

"/usr/bin/clang" -cc1 -triple x86_64-unknown-freebsd12.0 <...> -target-cpu skylake <...>

Comment 3 Dimitry Andric freebsd_committer

2018-02-21 14:15:25 UTC

Unfortunately I do not have skylake or AVX2 capable hardware at this moment, and -march=native on my IvyBridge machine works just fine (i.e, no SIGBUS).

Having this reduced to a somewhat smaller test case would be nice...

Comment 4 Dimitry Andric freebsd_committer

2018-02-23 10:15:31 UTC

(In reply to arrowd from comment #2)
> Nope, I'm on amd64:
> 
> "/usr/bin/clang" -cc1 -triple x86_64-unknown-freebsd12.0 <...> -target-cpu
> skylake <...>

Is it possible for you to figure out how GC.c is compiled on your system by the ghc build process, and then manually run the same command, adding "-v -save-temps" ?

Then please put the .c, .ii, .s and .o files in a tarball, together with a log of the full compiler output (e.g. the intermediate command lines that it shows via -v), and attach that.

Comment 5 Gleb Popov freebsd_committer

2018-03-04 19:23:05 UTC

Created attachment 191205 [details]
Tarball with requested files

Here is the tarball with files you requested.

Comment 6 Conrad Meyer freebsd_committer

2018-08-05 01:55:42 UTC

&the_gc_thread is cast to gc_thread *t.

new_gc_thread((gc_thread *)&the_gc_thread) =>
  ws = &t->gens[...]

ws is gen_workspace, which is __aligned(64) (supposedly).  (This is why Clang is able to generate the aligned AVX op.)

Neither gc_thread nor gc_thread::gens are tagged with any explicit alignment constraint.

the_gc_thread is declared as 'StgWord8 the_gc_thread[sizeof(gc_thread) + 64 * sizeof(gen_workspace)];', which has no alignment requirements.

That's the problem.  This is bogus code in GHC.

Comment 7 Gleb Popov freebsd_committer

2018-08-08 14:04:30 UTC

Fixed upstream: https://ghc.haskell.org/trac/ghc/ticket/15482

Thanks cem for your analysis.

Comment 8 Conrad Meyer freebsd_committer

2018-08-08 15:08:02 UTC

(In reply to Gleb Popov from comment #7)
Happy to help!  It was a fun puzzle and you did all the hard work for me :-).

Comment 9 Conrad Meyer freebsd_committer

2018-08-08 15:18:41 UTC

(In reply to Gleb Popov from comment #7)
Please note that the upstream GHC fix is incorrect and continue to follow-up with them about that.

Their change c6cc93bca only aligns the array to W_ aka StgWord aka StgWord64 aka unsigned long (8 bytes).

This is insufficient for AVX2 alignment[1] (16 bytes for xmm, 32 for ymm) and still violates the guarantee attached to the gen_workspace structure (64 byte alignment).  They need to remove the 64-byte gen_workspace alignment or add 64-byte alignment to the array to remove their UB.  (They could align both to the smaller 32 bytes and still allow the compiler to take advantage of AVX2.)

I don't know what lead them to believe an 8-byte alignment would fix an unaligned 32-byte AVX access.

[1]: https://www.felixcloutier.com/x86/MOVAPS.html