Bug 220184 - clang 4.0.0 segfaults on buildworld
Summary: clang 4.0.0 segfaults on buildworld
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 11.0-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-toolchain mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-06-21 13:26 UTC by Chris Collins
Modified: 2017-06-26 15:06 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Collins 2017-06-21 13:26:12 UTC
This system has done countless compiling including several buildworlds, and no segfaults, but now after base clang is updated to 4.0.0 the buildworld process segfaults. I have found numerous hits on google with clang 4.0.0 instabilities.

Data below

/usr/local/bin/ccache clang++  -O2 -pipe -I/usr/src/11s/contrib/llvm/tools/lldb/include -I/usr/src/11s/contrib/llvm/tools/lldb/source -I/usr/src/11s/contrib/llvm/tools/lldb/source/Plugins/Process/FreeBSD -I/usr/src/11s/contrib/llvm/tools/lldb/source/Plugins/Process/POSIX -I/usr/src/11s/contrib/llvm/tools/lldb/source/Plugins/Process/Utility -I/usr/obj/usr/src/11s/lib/clang/libllvm -I/usr/obj/usr/src/11s/lib/clang/libclang -DLLDB_DISABLE_PYTHON -march=nehalem -I/usr/src/11s/contrib/llvm/tools/clang/include -DCLANG_ENABLE_ARCMT -DCLANG_ENABLE_STATIC_ANALYZER -I/usr/src/11s/lib/clang/include -I/usr/src/11s/contrib/llvm/include -DLLVM_ON_UNIX -DLLVM_ON_FREEBSD -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -DNDEBUG -DLLVM_DEFAULT_TARGET_TRIPLE=\"x86_64-unknown-freebsd11.1\" -DLLVM_HOST_TRIPLE=\"x86_64-unknown-freebsd11.1\" -DDEFAULT_SYSROOT=\"\" -ffunction-sections -fdata-sections -MD -MF.depend.DataFormatters_TypeSummary.o -MTDataFormatters/TypeSummary.o -fstack-protector-strong -Qunused-arguments -Wno-deprecated  -std=c++11 -fno-exceptions -fno-rtti -stdlib=libc++ -Wno-c++11-extensions  -c /usr/src/11s/contrib/llvm/tools/lldb/source/DataFormatters/TypeSummary.cpp -o DataFormatters/TypeSummary.o
--- DataFormatters/TypeCategory.o ---
clang++: error: unable to execute command: Segmentation fault (core dumped)
clang++: error: clang frontend command failed due to signal (use -v to see invocation)
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)
Target: x86_64-unknown-freebsd11.1
Thread model: posix
InstalledDir: /usr/bin
clang++: note: diagnostic msg: PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script.
clang++: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang++: note: diagnostic msg: /tmp/TypeCategory-d16108.cpp
clang++: note: diagnostic msg: /tmp/TypeCategory-d16108.sh
clang++: note: diagnostic msg: 

********************
*** [DataFormatters/TypeCategory.o] Error code 254

make[6]: stopped in /usr/src/11s/lib/clang/liblldb
1 error
Comment 1 Chris Collins 2017-06-21 13:27:05 UTC
I cannot attach the 2 files, even when compressed at ultra settings its over the very small 1000kb limit.
Comment 2 Chris Collins 2017-06-21 13:33:37 UTC
google drive has the debug data

https://drive.google.com/file/d/0B7P3Ne0hzKcGNEladGlEMXhtM2s/view?usp=sharing
Comment 3 Mark Millard 2017-06-21 23:18:53 UTC
(In reply to Chris Collins from comment #0)

If you know adding some notes about the following
sorts of points it might help with folks that look
into this:

A) Are all the failures for compiling
   DataFormatters/TypeSummary.cpp ?

B) Does DataFormatters/TypeSummary.cpp
   sometimes compile okay (no Segmentation
   Fault)?

If the answer to (A) is "no" then questions
similar to (B) apply for the other failure
points.

I'll note that I had a context that was
analogous to (A): no with (B): yes and it
turned out that I had access to a system
that had its memory settings poorly
"optimized" (to the point of occasional
unreliability). Clang 4 seemed to show
this notably more than 3.9 did but it
was not clang's problem.

Of course I've no clue if such might apply
to your context. But it illustrates that
such additional notes can be helpful.
Comment 4 Chris Collins 2017-06-22 00:04:37 UTC
I have considered it could be memory or other i/o issues.  The fail point is not always the same spot.

I did read elsewhere tho that clang 4.0 has a bug which when it uses too much memory it will segfault.

I plan to test all this again on another machine, probably a VM I have running on a XEON which has ECC ram and nothing overclocked.

I will try to do this before the end of this upcoming weekend.

Interestingly the last 2 compile runs have succeeded. But they are mostly from cached ccache. So not really much if any actual compiling was done.

This system is overclocked on the cpu.

All the ports have been recompiled without segfaults using clang39 from ports, and no previous segfaults when base was 3.8.

However if I understand correctly when I first built 11-STABLE, it would have used clang40 compiled at the bootstrap stage to compile the rest of the world?
Comment 5 Mark Millard 2017-06-22 00:40:44 UTC
(In reply to Chris Collins from comment #4)

Chris wrote:

I did read elsewhere tho that clang 4.0 has a bug which when it uses too much memory it will segfault.


"too much memory": probably meaning more than
RAM+swap for the overall system activity, possibly
not having limited to -j1 . -j1 avoid RAM for
parallel processes, trading off time.



In some cases the OS might kill processes
instead of just having memory allocation calls
return failure.

I have used clang 4 for something that on a
machine with 16 GiBytes of RAM required
something like 10 GiBytes of swap as well
before the compiles and links were able to
finish. (This was actually an attempt to
build what turned out to be a debug version
of clang 4, where debug means far more than
just -g use.)

Once sufficient swap was present it completed:
it needed more than 24 GiBytes of "memory
space" overall --more than the machine had
for RAM. (Note: lld and the system linker do
not work well/fully for powerpc64 or powerpc
so this was using devel/powerpc64-binutils .)

(I do not remember if this was for -j4 or -j1
on that old PowerMac G5 so-called "Quad Core".)

I have never involved ccache.



Overall: the variability suggests either
hardware unreliability and/or RAM+swap
limitations for what you are attempting
(given clang 4's and other tool's memory
usage for the -j<?> in use or analogous
for ports).
Comment 6 Chris Collins 2017-06-22 14:15:54 UTC
An update

I have done several memtester runs all passes.
Removed cpu overclock
memtest86 passes
clang39 stable clang40 segfaults random (all new tests with ccache disabled)
I have swapped out memory sticks same result.
I have tried just using less ram so less slots populated and different slots.

I will test on a physical different system this weekend, possibly friday.

The system was originally FreeBSD 10, updated to FreeBSD 11.0, and then in the past week to FreeBSD11-STABLE.

make.conf only has cputype?=nehalam for the cpu.

src.conf is following


LOADER_ZFS_SUPPORT=YES
WITHOUT_GAMES=yes
WITHOUT_SENDMAIL=yes
WITHOUT_I4B=yes
WITHOUT_FLOPPY=yes
WITHOUT_PROFILE=yes
WITHOUT_IPFILTER=yes
WITHOUT_X11=yes
WITHOUT_BLUETOOTH=yes
WITHOUT_CVS=yes
WITHOUT_IPX=yes
WITHOUT_PPP=yes
WITHOUT_WIRELESS=yes
WITHOUT_CTM=yes
WITHOUT_LPR=yes
WITH_EXTRA_TCP_STACKS=yes

# BELOW FOR USERLAND DTRACE
WITH_CTF=yes

System runs zfs mirror pool and has a ssd for a SLOG device.

Its not a production server so I am not bothered about downtime and also is ok for the non redundant ZIL.
Comment 7 Chris Collins 2017-06-22 14:17:43 UTC
cpu is i5 750
16 gig of ram
Comment 8 Dimitry Andric freebsd_committer 2017-06-22 17:47:02 UTC
I cannot reproduce this crash with the sample you provided.  I tried:
* clang 4.0.0 (297347) on FreeBSD 11.1-BETA1 i386 and amd64
* clang 4.0.0 (297347) on FreeBSD 12.0-CURRENT i386 and amd64
* clang 5.0.0 (305575) on FreeBSD 12.0-CURRENT i386 and amd64.

It doesn't use a lot of memory either, roughly 250M max RSS:

        8.37 real         8.19 user         0.16 sys
    249616  maximum resident set size
     48201  average shared memory size
       268  average unshared data size
       249  average unshared stack size
     54447  page reclaims
      6410  page faults
         0  swaps
        32  block input operations
         5  block output operations
         0  messages sent
         0  messages received
         0  signals received
        20  voluntary context switches
       459  involuntary context switches

So memory starvation is pretty unlikely.  I would suspect hardware issues, in this case.
Comment 9 O. Hartmann 2017-06-22 18:26:41 UTC
In the past I saw similar segfaults and after all memory tests have passed successfully, I realised that the CPU temperature arose dramatically and the dissipation capacity of the cooler has been insufficient.

Since LLVM/CLANG 4.0.0 is in the tree, I realise a dramatic temperature increase on my Lenovo ThinkPad Edge E540, which is equipted with a Intel i5-4200M. The temperature is something I observe very carefully. this might be o coincidence, but I have the imagination that compiler developers try to use the facilities a CPU provides to speed up compilation, so the performance is in relation to power consumption and therefore heat dissipation.

On the other hand, I ripped off the CPU cooler and applied high quality thermal grease - and that dropped the CPU temperature from ~ 81 degree Celsius down to 66 - 72 degree Celsius within the same environment temperature and roughly the same OS revision (I did the grease application within one day and recompiled a complete world from scratch, again).

So, to make it short: check the grease and thermal conductivity of your CPU cooler. Thermal grease is not long-term stable, the same is for thermal pads. They get brittle and loose thermal conductivity capabilities over several years of use, and faster when the CPU is stressed by overclocking.
Comment 10 Conrad Meyer freebsd_committer 2017-06-22 19:32:12 UTC
If overheating of the CPU is causing segfaults (non-overclocked), your CPU is already damaged.  Some stress test like Prime95 or IntelBurnTest should also reproduce the issue.
Comment 11 Chris Collins 2017-06-22 22:38:16 UTC
Have now tested on an old laptop (slow hardware so long waiting time)

It has the exact same symptons.

Stable when building 11.0 or 10.3 on older clang.

Once on 11-STABLE, random segfaults on clang 4.0

Will test on the server class hardware at weekend, but given the results of this search and my significant testing of replacement ram etc. I think its a clang 4.0 issue.

Has FreeBSD changed compiler version before historically on a STABLE branch? like it has on 11.0 to 11.1 now?

google search "clang 4.0 segfault bug site:lists.llvm.org"
Comment 12 Chris Collins 2017-06-22 22:39:44 UTC
(In reply to Conrad Meyer from comment #10)

it has no issue with prime95 stress tests and other stress tests.

So to confirm absolutely 100% stable in every software on the system except clang 4.0 buildworld.

The cpu temperature is fine and will within spec.
Comment 13 Chris Collins 2017-06-22 22:42:55 UTC
This is with buildworld running

root@test 11s # sysctl dev.cpu |grep temper  
dev.cpu.3.temperature: 39.0C
dev.cpu.2.temperature: 40.0C
dev.cpu.1.temperature: 39.0C
dev.cpu.0.temperature: 40.0C

I will provide feedback saturday or sunday when I test on a EXSI instance, the host machine has ECC ram and a new XEON chip powering it. Also server class storage.
Comment 14 Mark Millard 2017-06-22 23:27:06 UTC
(In reply to Chris Collins from comment #11)

If this were a general problem the build servers
would not be able to build the releases, ports,
and such.

I do buildworld buildkernel for head on amd64, powerpc64,
aarch64, armv7, and powerpc. I've not been having such
problems. (I do cross builds amd64 -> <?> more than
native but do on occasion build native for the others.
My amd64 activity is under virtual box on either Windows
10 or macOS 10.12.5 at this point. The others are
directly on the hardware that I have access to.) I
build and run non-debug kernels normally despite running
versions of head.

If what you report was generally happening to others
most FreeBSD activity that is clang 4 based would be
largely "dead in the water" --but it is not. Almost
certainly some uncommon property in other environments
is a property of your environment and is involved. The
problem is isolating what is involved.

It may be time for detailed kernel config specifications.
As I remember you already listed the src.config that you
use (comment 6). None of my src.conf content matches any
of yours. I do not have any 11.x environments at this point,
just head based, currently -r320192 .

If you have a failing environment that can use a pure
GENERIC kernel config and a empty src.conf (or some
match to a well established set of such files), you
might want to try such. If it happens to work okay
then it would form the starting point of a search
for what makes the difference. By contrast if things
still fail this gets much harder to track down.

I can supply examples of my config files if needed
but I do not have defaults. (Just using clang 4 for
targeting powerpc64 or for powerpc is odd in the
first place: I gather evidence of issues that I
discover and report them, generally to llvm.) I do
have a few source file differences associated with
the experiments on non-amd64 --historically mostly
tied to powerpc64 and powerpc.

(Note: Actually powerpc (32-bit) has problems with
crashing even when sitting idle in my context, even
if built with gcc 4.2.1. I've had crashes in minutes
--or up to somewhat over 10 days 8 hours later. Usually
it has been hours but less than 9 hours. But use of
clang need not be involved at all for this so it
is not a fit to your context. And no other of my
environments has shown such behavior so far.)
Comment 15 Mark Millard 2017-06-23 01:01:33 UTC
(In reply to Mark Millard from comment #14)

My paragraph:

"If this were a general problem the build servers
would not be able to build the releases, ports,
and such."

was poorly chosen. I should have referred to just
test builds that are based on head, stable/11,
or the drafts of 11.1 . (I expect that there have
been many.) These likely start with
projects/clang*-import/ testing and continue with
head, stable/11, and the 11.1 drafts.

The official of releases and such likely are still
based on an older context building the newer
context. I do not know if they build and use a
bootstrap clang 4 and then use it or not when the
target is head, stable/11, or an 11.1 draft version
of some kind. It could be that only the system
compiler is built and installed but not used for
anything relative to buildworld buildkernel activity.

As I understand exp-runs were made for building
ports that were based on clang 4. This might
still be on-going.

My own activity is incremental updates of head,
so using clang 4 to build a bootstrap compiler
that is clang 4 when needed. Then using the
resultant clang 4 either way. (I ignore here
experimenting with devel/*xtoolchain* or using
gcc 4.2.1 where I have to [32-bit powerpc
kernel that finishes booting correctly].)

There is also likely activity of other people
working based on clang 4, including buildworld,
buildkernel, and building ports (ports that do
not force some gcc or some other toolchain).

I expect there is still enough activity based
on clang 4 that my overall argument structure
still holds: It would be good to try something
that matches a well used, well established
build configuration overall and see what
the status is for that build configuration.

I'll note that my activity is mostly based on
system-clang, not devel/llvm40 clang. Although
I have attempted devel/xtoolchain-llvm40 for
buildworld and buildkernel when there were
unusual failures like missing routines in
linking. (So far system-clang and
devel/xtoolchain-llvm40 have matched for such
build issues. But I've rarely tried this.)
Comment 16 Chris Collins 2017-06-23 03:19:15 UTC
I am not insisting its not hardware and I continue to persue the hardware route.

I am about to go bed as is 4am here, but I upped the vcore on my cpu and dram voltage on the system and done 2 buildworlds since with no segfaults, it is an old cpu so is possible voltage degradation has occurred to the point that stock voltage is not enough to be stable which is why I have raised the voltage.

I will start another buildworld now which will be a third, if it succeeds it will be the first time 3 have worked in a row.

It is still on the GENERIC kernel as well.

I will also do more runs tomorrow with an empty src.conf.

If these new runs all work (with increased voltage and of course also is good on my xeon), then yes I accept that as a hardware issue, and is possible my old laptop may have similar issues as that is old as well. :)
Comment 17 Chris Collins 2017-06-23 15:01:16 UTC
Perhaps buildworld with clang 4.0 is now the ultimate hardware stability test :)

3rd compile was fine, now running 4th.

Will still test on the server class hardware this weekend.

So it seems the diagnosis here is that clang 4.0 works the cpu harder so it is more likely to show up stability problems than clang 3.x?
Comment 18 Chris Collins 2017-06-25 21:13:58 UTC
Ok a further update.

After a reboot, the i5 750 machine started getting segfaults again, a few reboots later I have discovered the behaviour is fairly consistent, where a rolld o the dice occurs on a reboot, usually if the first buildworld has no problem I can probably do 3+ in a row with no segfault, but if the first has a segfault then I will struggle to get just one successful buildworld.

I discovered the LAPIC timer on my laptop is broken, aided by a warning on the console, when I switched it to i8254 the problem stopped.  I then fresh installed 11.0 again and discovered on 11.0 it uses i8254 by default but on 11-STABLE it uses LAPIC, when LAPIC is used I see some other odd behaviours e.g. systat -v 1 will update really slowly.

I then checked on my i5 750 on 11.0 it uses LAPIC by default and seems to work ok, on 11-STABLE LAPIC has the same issues as the laptop and it defaults to HPET.  At the time of this post I havent tried a buildworld using a non default timer, but I am running buildworld now using i8254 on the i5 750 to see what results I get, I will run many times over multiple reboots.

The VMWare hypervisor has no segfault problems and uses LAPIC by default working fine on 11.0 and 11-STABLE.

All the current tests are with empty src.conf aside from 'LOADER_ZFS_SUPPORT=YES'' and no CPUTYPE defined to try and simplify the diagnosis.
Comment 19 Chris Collins 2017-06-25 21:15:34 UTC
So to confirm as I dont think I written it well, using i8254 on my laptop I dont get segfaults.  The default timer changed between 11.0 and 11-STABLE.

I also meant "roll of the dice" but typod.
Comment 20 Mark Millard 2017-06-25 21:52:02 UTC
(In reply to Chris Collins from comments #18 and #19)

Interesting --and non-obvious.

From what I've read Message Signaled Interrupts (MSI)
from PCI 2.2+ depend on LAPIC, requiring LAPIC to be
enabled.

If LAPIC is not working correctly then MSI might not
work fully correctly either and so should be avoided
in such a context?

(I'm not familiar with the details in this area. Take
the above as hear-say.)
Comment 21 Chris Collins 2017-06-26 15:01:38 UTC
No issues on the i5 750 now as well across 4 reboots and 13 buildworlds.

I may raise a new bug regarding the timers, as I had to as well adjust the timecounter on my laptop to get C states working, its default kept it in C1 all the time, so seems is weird eventtimer and timecounter issues on older hardware.

The VMWare machine which has no is is a 2016 cpu.
The i5 750 cpu was released in 2009
The laptop cpu is a core 2 duo T5750 released in 2008

Thanks guys for your help.
Comment 22 Chris Collins 2017-06-26 15:06:17 UTC
(In reply to Mark Millard from comment #20)

Thanks, the laptop isnt using MSIX, or MSI anyway so I am ok on that, I will have a look at the i5 750 dmesg to see if MSI or MSIX is used.