I think this is a bug, but perhaps I just think it is because the behavior diverges from GCC by quite a bit. When you compile an application that does floating point math with clang in current right now, specifying -march=native or -march=sandybridge or btver2 or any number of CPUs I tried to target produces predominantly x87 instructions. Typically GCC and most other compilers I've seen default to targeting SSE instructions, instead. If you specify -march=x86-64, you get the results you'd expect (predominantly SSE2 instructions for floating point). This leads me to believe this is a bug, for most everything SSE units should be the most optimal.
Ahh, evidently this isn't a new issue. Any value of -march that is later than nocona results in x87 instructions being emitted all over the binary. I'd still consider it a bug, though, as Apple & Linux's clang ports appear to have correct behavior for these values.
Changed to 11.0-RELEASE because it happens as early as that (might have ever sense clang was added to base). It is still presently a bug in CURRENT, though.
Attempting to confirm. Using FreeBSD 11.0-STABLE #0 r316340 system. Built src/lib/msun with /etc/make.conf containing CPUTYPE=sandybridge clang -v FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0) objdump -dSRC libm.so.5 indicates lots of legacy fpu instructions like fmuls instead of mulss as well as usage of the %st() registers
(In reply to Adam Stylinski from comment #0) > I think this is a bug, but perhaps I just think it is because the behavior > diverges from GCC by quite a bit. > > When you compile an application that does floating point math with clang in > current right now, specifying -march=native or -march=sandybridge or btver2 > or any number of CPUs I tried to target produces predominantly x87 > instructions. Typically GCC and most other compilers I've seen default to > targeting SSE instructions, instead. Do you have any concrete test case? I.e a minimized C or C++ program, with the complete command line used to compile it, and pointers to the assembly where you think it is invalid? (Note that x87 instructions are still perfectly valid on even the newest x86 CPUs.) > If you specify -march=x86-64, you get the results you'd expect > (predominantly SSE2 instructions for floating point). This leads me to > believe this is a bug, for most everything SSE units should be the most > optimal. There are a few parts in the FreeBSD source tree which go out of their way to avoid SSE, so you might have hit those. Again, without a concrete example it is not possible to say if there is any problem. (In reply to Adam Stylinski from comment #1) > Ahh, evidently this isn't a new issue. Any value of -march that is later > than nocona results in x87 instructions being emitted all over the binary. > > I'd still consider it a bug, though, as Apple & Linux's clang ports appear > to have correct behavior for these values. I am highly skeptical about that, since there is no special "FreeBSD float" handling in LLVM or Clang. But if you provide a good test case, I can likely take it upstream.
Created attachment 181500 [details] Minimal test case I'm attaching a minimal test case. All that is needed to reproduce this is to have an amd64 system and compile with -march=sandybridge (I've also confirmed march=btver2 does this as well, and probably several others). The code that Rodney is referring to (libm) when compiled with this flag emits zero mulss instructions, which can't possibly optimal for an x86 scalar math library. It also obliterates some of the vectorized versions of scalar functions, such as some of the hyperbolic atan functions. The code that I should expect to have x87 instructions are the few where it makes sense, such as remquo, fmod, and other places that might benefit from fprem. Also, I'd expect it to be seen in the extended precision versions of the trig functions, as they benefit from the extra precision of x87. But it became pervasive. So this simple test example, this is without -march at all: 00000000004007b0 <main>: 4007b0: 55 push %rbp 4007b1: 48 89 e5 mov %rsp,%rbp 4007b4: 48 8b 7e 08 mov 0x8(%rsi),%rdi 4007b8: e8 2f fd ff ff callq 4004ec <atof@plt> 4007bd: f2 0f 5a c0 cvtsd2ss %xmm0,%xmm0 4007c1: f3 0f 59 05 5f 00 00 mulss 0x5f(%rip),%xmm0 # 400828 <_fini+0x10> 4007c8: 00 4007c9: f3 0f 5a c0 cvtss2sd %xmm0,%xmm0 4007cd: bf 2c 08 40 00 mov $0x40082c,%edi 4007d2: b0 01 mov $0x1,%al 4007d4: e8 03 fd ff ff callq 4004dc <printf@plt> 4007d9: 31 c0 xor %eax,%eax 4007db: 5d pop %rbp 4007dc: c3 retq 4007dd: 90 nop 4007de: 90 nop 4007df: 90 nop This is with -march=sandybridge: 00000000004007b0 <main>: 4007b0: 55 push %rbp 4007b1: 48 89 e5 mov %rsp,%rbp 4007b4: 48 8b 7e 08 mov 0x8(%rsi),%rdi 4007b8: e8 2f fd ff ff callq 4004ec <atof@plt> 4007bd: c5 lds (bad),%edi 4007be: fb sti 4007bf: 5a pop %rdx 4007c0: c0 c5 fa rol $0xfa,%ch 4007c3: 59 pop %rcx 4007c4: 05 5f 00 00 00 add $0x5f,%eax 4007c9: c5 lds (bad),%edi 4007ca: fa cli 4007cb: 5a pop %rdx 4007cc: c0 bf 2c 08 40 00 b0 sarb $0xb0,0x40082c(%rdi) 4007d3: 01 e8 add %ebp,%eax 4007d5: 03 fd add %ebp,%edi 4007d7: ff (bad) 4007d8: ff 31 pushq (%rcx) 4007da: c0 5d c3 90 rcrb $0x90,-0x3d(%rbp) 4007de: 90 nop 4007df: 90 nop
Also, on Linux, this is what I get with clang using -march=sandybridge: 0000000000400530 <main>: 400530: 50 push %rax 400531: 48 8b 7e 08 mov 0x8(%rsi),%rdi 400535: 31 f6 xor %esi,%esi 400537: e8 d4 fe ff ff callq 400410 <strtod@plt> 40053c: c5 fb 5a c0 vcvtsd2ss %xmm0,%xmm0,%xmm0 400540: c5 fa 59 05 9c 00 00 vmulss 0x9c(%rip),%xmm0,%xmm0 # 4005e4 <_IO_stdin_used+0x4> 400547: 00 400548: c5 fa 5a c0 vcvtss2sd %xmm0,%xmm0,%xmm0 40054c: bf e8 05 40 00 mov $0x4005e8,%edi 400551: b0 01 mov $0x1,%al 400553: e8 c8 fe ff ff callq 400420 <printf@plt> 400558: 31 c0 xor %eax,%eax 40055a: 59 pop %rcx 40055b: c3 retq 40055c: 0f 1f 40 00 nopl 0x0(%rax)
(In reply to Adam Stylinski from comment #5) > I'm attaching a minimal test case. All that is needed to reproduce this is > to have an amd64 system and compile with -march=sandybridge (I've also > confirmed march=btver2 does this as well, and probably several others). Hmm, I can't reproduce your issue with any version of clang that I have here (I tried 3.4.1 through 4.0.0, and trunk). You may be using a different command line, or have a different environment than me. For example, with clang 4.0.0 on my systems: $ clang -march=sandybridge -O2 -S test_prog.c -o - .text .file "test_prog.c" .section .rodata.cst4,"aM",@progbits,4 .p2align 2 .LCPI0_0: .long 1077936128 # float 3 .text .globl main .p2align 4, 0x90 .type main,@function main: # @main .cfi_startproc # BB#0: # %entry pushq %rbp .Lcfi0: .cfi_def_cfa_offset 16 .Lcfi1: .cfi_offset %rbp, -16 movq %rsp, %rbp .Lcfi2: .cfi_def_cfa_register %rbp movq 8(%rsi), %rdi callq atof vcvtsd2ss %xmm0, %xmm0, %xmm0 vmulss .LCPI0_0(%rip), %xmm0, %xmm0 vcvtss2sd %xmm0, %xmm0, %xmm0 movl $.L.str, %edi movb $1, %al callq printf xorl %eax, %eax popq %rbp retq .Lfunc_end0: .size main, .Lfunc_end0-main .cfi_endproc .type .L.str,@object # @.str .section .rodata.str1.1,"aMS",@progbits,1 .L.str: .asciz "a * b = %f\n" .size .L.str, 12 .ident "FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)" .section ".note.GNU-stack","",@progbits As you can see it outputs approximately the same code as you showed for the plain x86_64 case, and for Linux. > The code that Rodney is referring to (libm) when compiled with this flag > emits zero mulss instructions, which can't possibly optimal for an x86 > scalar math library. It also obliterates some of the vectorized versions of > scalar functions, such as some of the hyperbolic atan functions. As far as I know, our libm is very old, and geared mostly towards old school x87, as the maintainers (if we still have any :) seemed to have a great aversion against anything smelling of SSE. Maybe there are more modern maths libraries out there, which are optimized for recent CPUs, but this area is not my specialty.
(In reply to Dimitry Andric from comment #7) Hmm, using GNU's objdump in Linux on this binary seems to show me correct instructions. I think FreeBSD's object dump may be struggling with VEX encoded instructions? Seems like maybe libbfd is just too old.
Turns out the disassembler doesn't understand the VEX prefix for the SSE instructions. I'm not sure if FreeBSD uses GNU's objdump (I'd think not), but the latest version of objdump in GNU leverages a libbfd that understands these instructions. Considering we've been able to generate VEX instructions for a while now, it'd be nice to have a disassembler that recognized them.
Indeed, objdump in base is from GNU binutils 2.17.50, and is very old. It will not be updated, obviously. The version from the binutils port can display these instructions just fine. Ed, it's not likely that elftoolchain will give us a complete objdump, right? Since we already build llvm-objdump, we might consider using that instead, although the formatting is slightly off: Disassembly of section .text: main: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 8b 7e 08 movq 8(%rsi), %rdi 8: e8 00 00 00 00 callq 0 <main+0xD> d: c5 fb 5a c0 vcvtsd2ss %xmm0, %xmm0, %xmm0 11: c5 fa 59 05 00 00 00 00 vmulss (%rip), %xmm0, %xmm0 19: c5 fa 5a c0 vcvtss2sd %xmm0, %xmm0, %xmm0 1d: bf 00 00 00 00 movl $0, %edi 22: b0 01 movb $1, %al 24: e8 00 00 00 00 callq 0 <main+0x29> 29: 31 c0 xorl %eax, %eax 2b: 5d popq %rbp 2c: c3 retq
(In reply to Dimitry Andric from comment #10) > Ed, it's not likely that elftoolchain will give us a complete objdump, right? > Since we already build llvm-objdump, we might consider using that instead, > although the formatting is slightly off: Yes, I've been installing LLVM's llvm-objdump as /usr/bin/objdump in my staging tree (https://github.com/emaste/freebsd/commit/5d04768ffe7cf1ed695f8d1caa803f446c5ed110) It's missing a few command-line options compared to GNU objdump, and the formatting is a bit different, but it's broadly compatible. We currently use only three binutils tools: 1. as 2. ld 3. objdump LLVM's LLD as a replacement for ld is coming along well. We don't yet have a viable as replacement, although at least for amd64 we can do without it (and just use the compiler driver). So maybe we should introduce a conditional version (src.conf knob) of my patch above, with a plan of migrating to LLVM objdump by default in the future (and suggest that binutils from ports/packages be used in the cases where it is not sufficient).
(In reply to Rodney W. Grimes from comment #3) > objdump -dSRC libm.so.5 indicates lots of legacy fpu instructions like > fmuls instead of mulss as well as usage of the %st() registers Note that is mostly because of the many long double functions in our libm, which cannot be natively implemented with SSE. For example, lrintf() and lrint() are compiled to: 0000000000005090 <lrintf>: 5090: f3 48 0f 2d c0 cvtss2si %xmm0,%rax 5095: c3 retq 00000000000050a0 <lrint>: 50a0: f2 48 0f 2d c0 cvtsd2si %xmm0,%rax 50a5: c3 retq but lrintl() becomes: 00000000000050b0 <logbl>: 50b0: db 6c 24 08 fldt 0x8(%rsp) 50b4: d9 f4 fxtract 50b6: dd d8 fstp %st(0) 50b8: c3 retq
Eh, copy/paste error in that last comment, lrintl() is actually this: 0000000000005080 <lrintl>: 5080: db 6c 24 08 fldt 0x8(%rsp) 5084: 48 83 ec 08 sub $0x8,%rsp 5088: df 3c 24 fistpll (%rsp) 508b: 58 pop %rax 508c: c3 retq
Yeah I'd expect that to be the case for native extended precision. However it is most certainly an objdump issue rather than a clang issue.
I've got bitten by this when investigating https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226059 So, would we replace binutils objdump with LLVM one?
(In reply to arrowd from comment #16) > I've got bitten by this when investigating > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226059 > > So, would we replace binutils objdump with LLVM one? Since r310840 we already have llvm-objdump in /usr/bin by default, but it isn't installed as a the default objdump yet. There are likely still a few flag and output style differences, but maybe it's time to make the switch for 12.0. Ed, what do you think?
(In reply to Dimitry Andric from comment #17) I have a proposal to start installing llvm-objdump as /usr/bin/objdump in https://reviews.freebsd.org/D18307 now.
objdump 2.17.50 has now been removed for FreeBSD 13
Closing this, as this PR is not applicable anymore.