Bug 204671 - clang floating point wrong around Inf (i386)
Summary: clang floating point wrong around Inf (i386)
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 10.2-RELEASE
Hardware: i386 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-19 06:41 UTC by Valentin Nechayev
Modified: 2015-11-27 14:37 UTC (History)
2 users (show)

See Also:


Attachments
source files (736 bytes, application/gzip)
2015-11-19 06:41 UTC, Valentin Nechayev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Valentin Nechayev 2015-11-19 06:41:50 UTC
Created attachment 163320 [details]
source files

The test program, being called as "./t <a> <operator> <b> <rounding>",
performs a single arithmetic operator with the specified rounding and
prints its results. In some cases, output is wrong.

Conditions to reproduce:
1. Clang of any available version (confirmed on 3.4 from base,
clang36-3.6.2, clang37-3.7.2 from ports). I can't get this issue with
gcc-4.8.5, gcc-5.2.0_1 from ports.
2. i386 (amd64 isn't affected, I guess, because the issue is bound to FPU
variant).
3. no high -march= ("native" causes issues to disappear, I guess, for
the same connection to FPU; clang starts emitting SSE for this CPU).
4. -O or higher optimization level (-O0 isn't affected).

The OS is: FreeBSD 10.2-RELEASE-p7 i386.
The CPU on the test machine is: AMD Athlon(tm) 64 Processor 3500+
(Origin="AuthenticAMD"  Id=0x50ff2  Family=0xf  Model=0x5f  Stepping=2).

The proper results are (as I see from available IEEE754 documents):

$ ./t1 1e308 + 1e308 0
r=inf                       ( 7F F0 00 00 00 00 00 00)
$ ./t1 1e308 + 1e308 1
r=1.797693134862316e+308    ( 7F EF FF FF FF FF FF FF)
$ ./t1 1e308 + 1e308 2
r=inf                       ( 7F F0 00 00 00 00 00 00)
$ ./t1 1e308 + 1e308 3
r=1.797693134862316e+308    ( 7F EF FF FF FF FF FF FF)

This satisties the standard requirement that, e.g., "roundTowardZero,
the result shall be the format's floating-point number closest to and no
greater in magnitude than the infinitely precise result."

The variant with t1.c from attachment when the issue is exposed
(compiled as "cc -o t1 t1.c -g -Wall -W -lm -O"):

$ ./t1 1e308 + 1e308 0
r=inf                       ( 7F F0 00 00 00 00 00 00)
$ ./t1 1e308 + 1e308 1
r=inf                       ( 7F EF FF FF FF FF FF FF)
$ ./t1 1e308 + 1e308 2
r=inf                       ( 7F F0 00 00 00 00 00 00)
$ ./t1 1e308 + 1e308 3
r=inf                       ( 7F EF FF FF FF FF FF FF)

So, the binary representation of result is correct, but the printf
output is not.

The same compilation with -DNO_HEX always prints "inf" (so, it rejects a
guess of an aliasing issue):

$ ./t1 1e308 + 1e308 0
r=inf                       ()
$ ./t1 1e308 + 1e308 1
r=inf                       ()
$ ./t1 1e308 + 1e308 2
r=inf                       ()
$ ./t1 1e308 + 1e308 3
r=inf                       ()

The variant in t2.c uses global union instead of local on-stack
one for binary printing. The behavior differs so binary representation
always shows "inf":

$ ./t2 1e308 + 1e308 0
r=inf                       ( 7F F0 00 00 00 00 00 00)
$ ./t2 1e308 + 1e308 1
r=inf                       ( 7F F0 00 00 00 00 00 00)
$ ./t2 1e308 + 1e308 2
r=inf                       ( 7F F0 00 00 00 00 00 00)
$ ./t2 1e308 + 1e308 3
r=inf                       ( 7F F0 00 00 00 00 00 00)

Again, adding -DNO_HEX causes "inf" still printed in all cases.

But: a variant with "r" declared as global variable instead of local one
(-DR_GLOBAL for both source versions) stops the issue.
Comment 1 Valentin Nechayev 2015-11-20 10:15:02 UTC
Not reproduced on Kubuntu/i386 14.04, clang-3.6 from packages, AMD FX-8150 => seems FreeBSD specific.
Comment 2 Jilles Tjoelker freebsd_committer 2015-11-21 20:37:24 UTC
This is related to the strangeness that is the x87 FPU. Internally, the x87 performs calculations in extended precision. Even if the precision control is set to double precision, like FreeBSD and Windows do by default but Linux and Solaris do not, the x87 registers still have greater range than double precision.

As a result, the addition 1e308 + 1e308 does not overflow, but produces a result of approximately 2e308 in an x87 register. When this result is stored to memory in double precision format, overflow or rounding will occur.

What happens in t1.c is that the conversion from extended to double precision happens two times. The conversion for printing the bytes happens directly after the calculation and therefore uses the modified rounding mode. The conversion for printf happens during the inlined fesetround() call, after setting the x87 rounding mode and before calling a function __test_sse to check whether SSE is available. (After that, the value is stored and loaded again a few times.) Therefore, the conversion for printf uses an incorrect rounding mode.

Global variables force the compiler to store values to memory more often and may therefore reduce x87 weirdnesses.

Following the C standard, you would have to use  #pragma STDC FENV_ACCESS on  to make this work reliably. However, neither gcc nor clang support this pragma. They follow an ad hoc approach to floating point exceptions and modes. In gcc you can use -frounding-math to prevent some problematic optimizations but clang doesn't even support that. Clang has a bug about the pragma, https://llvm.org/bugs/show_bug.cgi?id=8100, which has been open for five years with various duplicates but no other significant action.

You will generally have fewer problems with weirdly changing floating point results if you use SSE instead of the x87 FPU, assuming your CPUs are new enough. SSE performs calculations in the precision specified by the program (single or double), so it does not matter when or if a value is spilled to memory. As noted above, GCC and clang are still ignorant about the side effects with the floating point exceptions and modes, though.
Comment 3 Valentin Nechayev 2015-11-22 09:11:48 UTC
(In reply to Jilles Tjoelker from comment #2)

Jilles, thanks for the excellent explanation. This exposes I have lost some important advances in floating point (like FENV_ACCESS role and need). But,

> The conversion for printf happens during the inlined fesetround() call, after setting the x87 rounding mode and before calling a function __test_sse to check whether SSE is available.

Isn't this the issue by itself? If fesetround() makes an action which generally shall be atomic, no intervention must be allowed during this setting. If it can't be explained in inlined version using C, either "asm volatile" should be used, or a fully separate function.

> You will generally have fewer problems with weirdly changing floating point results if you use SSE instead of the x87 FPU, assuming your CPUs are new enough.

Yep, SSE is better in all senses, if supported and exploited. But the latter is a separate issue. Default compiler installation for the i386 target still uses the least possible CPU (as far as I see from compilation without any options in make.conf). Old style option update (CFLAGS?= in make.conf) doesn't work anymore. With the current install base, I'd prefer to see an option in installer which suggests something like "-march=nocona -mtune=native" for local builds. (This also hints at the very old topic with having a subtarget for binary builds for modern processors, since 99% of them are at least P3, and deliver them for freebsd-update... but this is definitely not the current ticket issue...) For this particular installation, I had neither strong reason nor inspiration to convert it to 64-bit one, so still are many users. So, make.conf will be loaded with weird constructs like

NO_CPU_CFLAGS=true
NO_CPU_COPTFLAGS=true
.if ${.CURDIR:N*/BSD/src/*} == ""
CFLAGS+= -march=nocona -mtune=k8 -mmmx -msse -msse2
COPTFLAGS+= -march=i686 -mtune=k8
.endif

> Clang has a bug about the pragma, https://llvm.org/bugs/show_bug.cgi?id=8100, which has been open for five years with various duplicates but no other significant action.

As soon as they rely on GCC frontend, I doubt this will be fixed until GCC guys implement its support on their side.
Comment 4 Jilles Tjoelker freebsd_committer 2015-11-27 14:37:21 UTC
It may be reasonable to make i386 fesetround() a non-inline function, at least when compiling without SSE (__test_sse() may be called). In that case, compilers are likely to save caller-save registers already, so part of the cost of a function call is already paid, even though an actual function call only happens the first time.

Alternatively, __test_sse() could be called somewhere during startup, so the function call in the inlined fesetround() is not needed. This will reduce code size of fesetround() calls considerably, but rounding to float or double is still likely to use an incorrect rounding mode.

When compiling with SSE, there is no __test_sse() call and the behaviour for the x87 is similar; however, if SSE2 is enabled, the x87 is probably only used for long double.