218387 – GNU objdump in base doesn't understand VEX instructions

Bug 218387 - GNU objdump in base doesn't understand VEX instructions

Summary: GNU objdump in base doesn't understand VEX instructions

Status:	Closed Overcome By Events

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	11.0-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	Ed Maste

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-04-05 01:35 UTC by Adam Stylinski
Modified:	2021-09-16 14:23 UTC (History)
CC List:	5 users (show)

See Also:	229046

Attachments
Minimal test case (183 bytes, text/plain) 2017-04-05 07:19 UTC, Adam Stylinski	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Adam Stylinski 2017-04-05 01:35:39 UTC

I think this is a bug, but perhaps I just think it is because the behavior diverges from GCC by quite a bit.

When you compile an application that does floating point math with clang in current right now, specifying -march=native or -march=sandybridge or btver2 or any number of CPUs I tried to target produces predominantly x87 instructions.  Typically GCC and most other compilers I've seen default to targeting SSE instructions, instead.

If you specify -march=x86-64, you get the results you'd expect (predominantly SSE2 instructions for floating point).  This leads me to believe this is a bug, for most everything SSE units should be the most optimal.

Comment 1 Adam Stylinski 2017-04-05 02:41:04 UTC

Ahh, evidently this isn't a new issue.  Any value of -march that is later than nocona results in x87 instructions being emitted all over the binary.

I'd still consider it a bug, though, as Apple & Linux's clang ports appear to have correct behavior for these values.

Comment 2 Adam Stylinski 2017-04-05 03:44:39 UTC

Changed to 11.0-RELEASE because it happens as early as that (might have ever sense clang was added to base).  It is still presently a bug in CURRENT, though.

Comment 3 Rodney W. Grimes freebsd_committer

2017-04-05 06:24:50 UTC

Attempting to confirm.

Using FreeBSD 11.0-STABLE #0 r316340 system.
Built src/lib/msun with /etc/make.conf containing CPUTYPE=sandybridge

clang -v
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)

objdump -dSRC libm.so.5 indicates lots of legacy fpu instructions like
fmuls instead of mulss as well as usage of the %st() registers

Comment 4 Dimitry Andric freebsd_committer

2017-04-05 07:01:40 UTC

(In reply to Adam Stylinski from comment #0)
> I think this is a bug, but perhaps I just think it is because the behavior
> diverges from GCC by quite a bit.
> 
> When you compile an application that does floating point math with clang in
> current right now, specifying -march=native or -march=sandybridge or btver2
> or any number of CPUs I tried to target produces predominantly x87
> instructions.  Typically GCC and most other compilers I've seen default to
> targeting SSE instructions, instead.

Do you have any concrete test case? I.e a minimized C or C++ program, with the complete command line used to compile it, and pointers to the assembly where you think it is invalid?  (Note that x87 instructions are still perfectly valid on even the newest x86 CPUs.)


> If you specify -march=x86-64, you get the results you'd expect
> (predominantly SSE2 instructions for floating point).  This leads me to
> believe this is a bug, for most everything SSE units should be the most
> optimal.

There are a few parts in the FreeBSD source tree which go out of their way to avoid SSE, so you might have hit those.  Again, without a concrete example it is not possible to say if there is any problem.


(In reply to Adam Stylinski from comment #1)
> Ahh, evidently this isn't a new issue.  Any value of -march that is later
> than nocona results in x87 instructions being emitted all over the binary.
> 
> I'd still consider it a bug, though, as Apple & Linux's clang ports appear
> to have correct behavior for these values.

I am highly skeptical about that, since there is no special "FreeBSD float" handling in LLVM or Clang.  But if you provide a good test case, I can likely take it upstream.

Comment 5 Adam Stylinski 2017-04-05 07:19:48 UTC

Created attachment 181500 [details]
Minimal test case

I'm attaching a minimal test case.  All that is needed to reproduce this is to have an amd64 system and compile with -march=sandybridge (I've also confirmed march=btver2 does this as well, and probably several others).

The code that Rodney is referring to (libm) when compiled with this flag emits zero mulss instructions, which can't possibly optimal for an x86 scalar math library.  It also obliterates some of the vectorized versions of scalar functions, such as some of the hyperbolic atan functions.

The code that I should expect to have x87 instructions are the few where it makes sense, such as remquo, fmod, and other places that might benefit from fprem.  Also, I'd expect it to be seen in the extended precision versions of the trig functions, as they benefit from the extra precision of x87.  But it became pervasive.

So this simple test example, this is without -march at all:

00000000004007b0 <main>:
  4007b0:       55                      push   %rbp
  4007b1:       48 89 e5                mov    %rsp,%rbp
  4007b4:       48 8b 7e 08             mov    0x8(%rsi),%rdi
  4007b8:       e8 2f fd ff ff          callq  4004ec <atof@plt>
  4007bd:       f2 0f 5a c0             cvtsd2ss %xmm0,%xmm0
  4007c1:       f3 0f 59 05 5f 00 00    mulss  0x5f(%rip),%xmm0        # 400828 <_fini+0x10>
  4007c8:       00 
  4007c9:       f3 0f 5a c0             cvtss2sd %xmm0,%xmm0
  4007cd:       bf 2c 08 40 00          mov    $0x40082c,%edi
  4007d2:       b0 01                   mov    $0x1,%al
  4007d4:       e8 03 fd ff ff          callq  4004dc <printf@plt>
  4007d9:       31 c0                   xor    %eax,%eax
  4007db:       5d                      pop    %rbp
  4007dc:       c3                      retq   
  4007dd:       90                      nop    
  4007de:       90                      nop    
  4007df:       90                      nop    

This is with -march=sandybridge:

00000000004007b0 <main>:
  4007b0:       55                      push   %rbp
  4007b1:       48 89 e5                mov    %rsp,%rbp
  4007b4:       48 8b 7e 08             mov    0x8(%rsi),%rdi
  4007b8:       e8 2f fd ff ff          callq  4004ec <atof@plt>
  4007bd:       c5                      lds    (bad),%edi
  4007be:       fb                      sti    
  4007bf:       5a                      pop    %rdx
  4007c0:       c0 c5 fa                rol    $0xfa,%ch
  4007c3:       59                      pop    %rcx
  4007c4:       05 5f 00 00 00          add    $0x5f,%eax
  4007c9:       c5                      lds    (bad),%edi
  4007ca:       fa                      cli    
  4007cb:       5a                      pop    %rdx
  4007cc:       c0 bf 2c 08 40 00 b0    sarb   $0xb0,0x40082c(%rdi)
  4007d3:       01 e8                   add    %ebp,%eax
  4007d5:       03 fd                   add    %ebp,%edi
  4007d7:       ff                      (bad)  
  4007d8:       ff 31                   pushq  (%rcx)
  4007da:       c0 5d c3 90             rcrb   $0x90,-0x3d(%rbp)
  4007de:       90                      nop    
  4007df:       90                      nop

Comment 6 Adam Stylinski 2017-04-05 07:24:08 UTC

Also, on Linux, this is what I get with clang using -march=sandybridge:

0000000000400530 <main>:
  400530:       50                      push   %rax
  400531:       48 8b 7e 08             mov    0x8(%rsi),%rdi
  400535:       31 f6                   xor    %esi,%esi
  400537:       e8 d4 fe ff ff          callq  400410 <strtod@plt>
  40053c:       c5 fb 5a c0             vcvtsd2ss %xmm0,%xmm0,%xmm0
  400540:       c5 fa 59 05 9c 00 00    vmulss 0x9c(%rip),%xmm0,%xmm0        # 4005e4 <_IO_stdin_used+0x4>
  400547:       00 
  400548:       c5 fa 5a c0             vcvtss2sd %xmm0,%xmm0,%xmm0
  40054c:       bf e8 05 40 00          mov    $0x4005e8,%edi
  400551:       b0 01                   mov    $0x1,%al
  400553:       e8 c8 fe ff ff          callq  400420 <printf@plt>
  400558:       31 c0                   xor    %eax,%eax
  40055a:       59                      pop    %rcx
  40055b:       c3                      retq   
  40055c:       0f 1f 40 00             nopl   0x0(%rax)

Comment 7 Dimitry Andric freebsd_committer

2017-04-05 08:54:22 UTC

(In reply to Adam Stylinski from comment #5)
> I'm attaching a minimal test case.  All that is needed to reproduce this is
> to have an amd64 system and compile with -march=sandybridge (I've also
> confirmed march=btver2 does this as well, and probably several others).

Hmm, I can't reproduce your issue with any version of clang that I have here (I tried 3.4.1 through 4.0.0, and trunk).  You may be using a different command line, or have a different environment than me.  For example, with clang 4.0.0 on my systems:

$ clang -march=sandybridge -O2 -S test_prog.c -o -
        .text
        .file   "test_prog.c"
        .section        .rodata.cst4,"aM",@progbits,4
        .p2align        2
.LCPI0_0:
        .long   1077936128              # float 3
        .text
        .globl  main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbp
.Lcfi0:
        .cfi_def_cfa_offset 16
.Lcfi1:
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
.Lcfi2:
        .cfi_def_cfa_register %rbp
        movq    8(%rsi), %rdi
        callq   atof
        vcvtsd2ss       %xmm0, %xmm0, %xmm0
        vmulss  .LCPI0_0(%rip), %xmm0, %xmm0
        vcvtss2sd       %xmm0, %xmm0, %xmm0
        movl    $.L.str, %edi
        movb    $1, %al
        callq   printf
        xorl    %eax, %eax
        popq    %rbp
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc

        .type   .L.str,@object          # @.str
        .section        .rodata.str1.1,"aMS",@progbits,1
.L.str:
        .asciz  "a * b = %f\n"
        .size   .L.str, 12


        .ident  "FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)"
        .section        ".note.GNU-stack","",@progbits

As you can see it outputs approximately the same code as you showed for the plain x86_64 case, and for Linux.


> The code that Rodney is referring to (libm) when compiled with this flag
> emits zero mulss instructions, which can't possibly optimal for an x86
> scalar math library.  It also obliterates some of the vectorized versions of
> scalar functions, such as some of the hyperbolic atan functions.

As far as I know, our libm is very old, and geared mostly towards old school x87, as the maintainers (if we still have any :) seemed to have a great aversion against anything smelling of SSE.  Maybe there are more modern maths libraries out there, which are optimized for recent CPUs, but this area is not my specialty.

Comment 8 Adam Stylinski 2017-04-05 13:25:23 UTC

(In reply to Dimitry Andric from comment #7)

Hmm, using GNU's objdump in Linux on this binary seems to show me correct instructions.  I think FreeBSD's object dump may be struggling with VEX encoded instructions?

Seems like maybe libbfd is just too old.

Comment 9 Adam Stylinski 2017-04-05 15:57:45 UTC

Turns out the disassembler doesn't understand the VEX prefix for the SSE instructions.  I'm not sure if FreeBSD uses GNU's objdump (I'd think not), but the latest version of objdump in GNU leverages a libbfd that understands these instructions.

Considering we've been able to generate VEX instructions for a while now, it'd be nice to have a disassembler that recognized them.

Comment 10 Dimitry Andric freebsd_committer

2017-04-05 20:55:42 UTC

Indeed, objdump in base is from GNU binutils 2.17.50, and is very old.  It will not be updated, obviously.  The version from the binutils port can display these instructions just fine.

Ed, it's not likely that elftoolchain will give us a complete objdump, right?  Since we already build llvm-objdump, we might consider using that instead, although the formatting is slightly off:

Disassembly of section .text:
main:
       0:	55 	pushq	%rbp
       1:	48 89 e5 	movq	%rsp, %rbp
       4:	48 8b 7e 08 	movq	8(%rsi), %rdi
       8:	e8 00 00 00 00 	callq	0 <main+0xD>
       d:	c5 fb 5a c0 	vcvtsd2ss	%xmm0, %xmm0, %xmm0
      11:	c5 fa 59 05 00 00 00 00 	vmulss	(%rip), %xmm0, %xmm0
      19:	c5 fa 5a c0 	vcvtss2sd	%xmm0, %xmm0, %xmm0
      1d:	bf 00 00 00 00 	movl	$0, %edi
      22:	b0 01 	movb	$1, %al
      24:	e8 00 00 00 00 	callq	0 <main+0x29>
      29:	31 c0 	xorl	%eax, %eax
      2b:	5d 	popq	%rbp
      2c:	c3 	retq

Comment 11 Ed Maste freebsd_committer

2017-04-05 21:02:58 UTC

(In reply to Dimitry Andric from comment #10)
> Ed, it's not likely that elftoolchain will give us a complete objdump, right?
> Since we already build llvm-objdump, we might consider using that instead,
> although the formatting is slightly off:

Yes, I've been installing LLVM's llvm-objdump as /usr/bin/objdump in my staging tree (https://github.com/emaste/freebsd/commit/5d04768ffe7cf1ed695f8d1caa803f446c5ed110)

It's missing a few command-line options compared to GNU objdump, and the formatting is a bit different, but it's broadly compatible.

We currently use only three binutils tools:

1. as
2. ld
3. objdump

LLVM's LLD as a replacement for ld is coming along well. We don't yet have a viable as replacement, although at least for amd64 we can do without it (and just use the compiler driver). So maybe we should introduce a conditional version (src.conf knob) of my patch above, with a plan of migrating to LLVM objdump by default in the future (and suggest that binutils from ports/packages be used in the cases where it is not sufficient).

Comment 12 Dimitry Andric freebsd_committer

2017-04-05 21:20:26 UTC

(In reply to Rodney W. Grimes from comment #3)
> objdump -dSRC libm.so.5 indicates lots of legacy fpu instructions like
> fmuls instead of mulss as well as usage of the %st() registers

Note that is mostly because of the many long double functions in our libm, which cannot be natively implemented with SSE.  For example, lrintf() and lrint() are compiled to:

0000000000005090 <lrintf>:
    5090:       f3 48 0f 2d c0          cvtss2si %xmm0,%rax
    5095:       c3                      retq

00000000000050a0 <lrint>:
    50a0:       f2 48 0f 2d c0          cvtsd2si %xmm0,%rax
    50a5:       c3                      retq

but lrintl() becomes:

00000000000050b0 <logbl>:
    50b0:       db 6c 24 08             fldt   0x8(%rsp)
    50b4:       d9 f4                   fxtract
    50b6:       dd d8                   fstp   %st(0)
    50b8:       c3                      retq

Comment 13 Dimitry Andric freebsd_committer

2017-04-05 21:21:17 UTC

Eh, copy/paste error in that last comment, lrintl() is actually this:

0000000000005080 <lrintl>:
    5080:       db 6c 24 08             fldt   0x8(%rsp)
    5084:       48 83 ec 08             sub    $0x8,%rsp
    5088:       df 3c 24                fistpll (%rsp)
    508b:       58                      pop    %rax
    508c:       c3                      retq

Comment 14 Adam Stylinski 2017-04-05 21:23:06 UTC

Yeah I'd expect that to be the case for native extended precision.

However it is most certainly an objdump issue rather than a clang issue.

Comment 15 Adam Stylinski 2017-04-05 21:23:18 UTC

Yeah I'd expect that to be the case for native extended precision.

However it is most certainly an objdump issue rather than a clang issue.

Comment 16 Gleb Popov freebsd_committer

2018-03-21 05:48:05 UTC

I've got bitten by this when investigating https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226059

So, would we replace binutils objdump with LLVM one?

Comment 17 Dimitry Andric freebsd_committer

2018-03-21 07:45:44 UTC

(In reply to arrowd from comment #16)
> I've got bitten by this when investigating
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226059
> 
> So, would we replace binutils objdump with LLVM one?

Since r310840 we already have llvm-objdump in /usr/bin by default, but it isn't installed as a the default objdump yet.  There are likely still a few flag and output style differences, but maybe it's time to make the switch for 12.0.  Ed, what do you think?

Comment 18 Ed Maste freebsd_committer

2018-11-23 14:20:35 UTC

(In reply to Dimitry Andric from comment #17)
I have a proposal to start installing llvm-objdump as /usr/bin/objdump in https://reviews.freebsd.org/D18307 now.

Comment 19 Ed Maste freebsd_committer

2020-05-07 21:24:52 UTC

objdump 2.17.50 has now been removed for FreeBSD 13

Comment 20 Gleb Popov freebsd_committer

2021-09-16 14:23:53 UTC

Closing this, as this PR is not applicable anymore.