This system has done countless compiling including several buildworlds, and no segfaults, but now after base clang is updated to 4.0.0 the buildworld process segfaults. I have found numerous hits on google with clang 4.0.0 instabilities. Data below /usr/local/bin/ccache clang++ -O2 -pipe -I/usr/src/11s/contrib/llvm/tools/lldb/include -I/usr/src/11s/contrib/llvm/tools/lldb/source -I/usr/src/11s/contrib/llvm/tools/lldb/source/Plugins/Process/FreeBSD -I/usr/src/11s/contrib/llvm/tools/lldb/source/Plugins/Process/POSIX -I/usr/src/11s/contrib/llvm/tools/lldb/source/Plugins/Process/Utility -I/usr/obj/usr/src/11s/lib/clang/libllvm -I/usr/obj/usr/src/11s/lib/clang/libclang -DLLDB_DISABLE_PYTHON -march=nehalem -I/usr/src/11s/contrib/llvm/tools/clang/include -DCLANG_ENABLE_ARCMT -DCLANG_ENABLE_STATIC_ANALYZER -I/usr/src/11s/lib/clang/include -I/usr/src/11s/contrib/llvm/include -DLLVM_ON_UNIX -DLLVM_ON_FREEBSD -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -DNDEBUG -DLLVM_DEFAULT_TARGET_TRIPLE=\"x86_64-unknown-freebsd11.1\" -DLLVM_HOST_TRIPLE=\"x86_64-unknown-freebsd11.1\" -DDEFAULT_SYSROOT=\"\" -ffunction-sections -fdata-sections -MD -MF.depend.DataFormatters_TypeSummary.o -MTDataFormatters/TypeSummary.o -fstack-protector-strong -Qunused-arguments -Wno-deprecated -std=c++11 -fno-exceptions -fno-rtti -stdlib=libc++ -Wno-c++11-extensions -c /usr/src/11s/contrib/llvm/tools/lldb/source/DataFormatters/TypeSummary.cpp -o DataFormatters/TypeSummary.o --- DataFormatters/TypeCategory.o --- clang++: error: unable to execute command: Segmentation fault (core dumped) clang++: error: clang frontend command failed due to signal (use -v to see invocation) FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0) Target: x86_64-unknown-freebsd11.1 Thread model: posix InstalledDir: /usr/bin clang++: note: diagnostic msg: PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and include the crash backtrace, preprocessed source, and associated run script. clang++: note: diagnostic msg: ******************** PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT: Preprocessed source(s) and associated run script(s) are located at: clang++: note: diagnostic msg: /tmp/TypeCategory-d16108.cpp clang++: note: diagnostic msg: /tmp/TypeCategory-d16108.sh clang++: note: diagnostic msg: ******************** *** [DataFormatters/TypeCategory.o] Error code 254 make[6]: stopped in /usr/src/11s/lib/clang/liblldb 1 error
I cannot attach the 2 files, even when compressed at ultra settings its over the very small 1000kb limit.
google drive has the debug data https://drive.google.com/file/d/0B7P3Ne0hzKcGNEladGlEMXhtM2s/view?usp=sharing
(In reply to Chris Collins from comment #0) If you know adding some notes about the following sorts of points it might help with folks that look into this: A) Are all the failures for compiling DataFormatters/TypeSummary.cpp ? B) Does DataFormatters/TypeSummary.cpp sometimes compile okay (no Segmentation Fault)? If the answer to (A) is "no" then questions similar to (B) apply for the other failure points. I'll note that I had a context that was analogous to (A): no with (B): yes and it turned out that I had access to a system that had its memory settings poorly "optimized" (to the point of occasional unreliability). Clang 4 seemed to show this notably more than 3.9 did but it was not clang's problem. Of course I've no clue if such might apply to your context. But it illustrates that such additional notes can be helpful.
I have considered it could be memory or other i/o issues. The fail point is not always the same spot. I did read elsewhere tho that clang 4.0 has a bug which when it uses too much memory it will segfault. I plan to test all this again on another machine, probably a VM I have running on a XEON which has ECC ram and nothing overclocked. I will try to do this before the end of this upcoming weekend. Interestingly the last 2 compile runs have succeeded. But they are mostly from cached ccache. So not really much if any actual compiling was done. This system is overclocked on the cpu. All the ports have been recompiled without segfaults using clang39 from ports, and no previous segfaults when base was 3.8. However if I understand correctly when I first built 11-STABLE, it would have used clang40 compiled at the bootstrap stage to compile the rest of the world?
(In reply to Chris Collins from comment #4) Chris wrote: I did read elsewhere tho that clang 4.0 has a bug which when it uses too much memory it will segfault. "too much memory": probably meaning more than RAM+swap for the overall system activity, possibly not having limited to -j1 . -j1 avoid RAM for parallel processes, trading off time. In some cases the OS might kill processes instead of just having memory allocation calls return failure. I have used clang 4 for something that on a machine with 16 GiBytes of RAM required something like 10 GiBytes of swap as well before the compiles and links were able to finish. (This was actually an attempt to build what turned out to be a debug version of clang 4, where debug means far more than just -g use.) Once sufficient swap was present it completed: it needed more than 24 GiBytes of "memory space" overall --more than the machine had for RAM. (Note: lld and the system linker do not work well/fully for powerpc64 or powerpc so this was using devel/powerpc64-binutils .) (I do not remember if this was for -j4 or -j1 on that old PowerMac G5 so-called "Quad Core".) I have never involved ccache. Overall: the variability suggests either hardware unreliability and/or RAM+swap limitations for what you are attempting (given clang 4's and other tool's memory usage for the -j<?> in use or analogous for ports).
An update I have done several memtester runs all passes. Removed cpu overclock memtest86 passes clang39 stable clang40 segfaults random (all new tests with ccache disabled) I have swapped out memory sticks same result. I have tried just using less ram so less slots populated and different slots. I will test on a physical different system this weekend, possibly friday. The system was originally FreeBSD 10, updated to FreeBSD 11.0, and then in the past week to FreeBSD11-STABLE. make.conf only has cputype?=nehalam for the cpu. src.conf is following LOADER_ZFS_SUPPORT=YES WITHOUT_GAMES=yes WITHOUT_SENDMAIL=yes WITHOUT_I4B=yes WITHOUT_FLOPPY=yes WITHOUT_PROFILE=yes WITHOUT_IPFILTER=yes WITHOUT_X11=yes WITHOUT_BLUETOOTH=yes WITHOUT_CVS=yes WITHOUT_IPX=yes WITHOUT_PPP=yes WITHOUT_WIRELESS=yes WITHOUT_CTM=yes WITHOUT_LPR=yes WITH_EXTRA_TCP_STACKS=yes # BELOW FOR USERLAND DTRACE WITH_CTF=yes System runs zfs mirror pool and has a ssd for a SLOG device. Its not a production server so I am not bothered about downtime and also is ok for the non redundant ZIL.
cpu is i5 750 16 gig of ram
I cannot reproduce this crash with the sample you provided. I tried: * clang 4.0.0 (297347) on FreeBSD 11.1-BETA1 i386 and amd64 * clang 4.0.0 (297347) on FreeBSD 12.0-CURRENT i386 and amd64 * clang 5.0.0 (305575) on FreeBSD 12.0-CURRENT i386 and amd64. It doesn't use a lot of memory either, roughly 250M max RSS: 8.37 real 8.19 user 0.16 sys 249616 maximum resident set size 48201 average shared memory size 268 average unshared data size 249 average unshared stack size 54447 page reclaims 6410 page faults 0 swaps 32 block input operations 5 block output operations 0 messages sent 0 messages received 0 signals received 20 voluntary context switches 459 involuntary context switches So memory starvation is pretty unlikely. I would suspect hardware issues, in this case.
In the past I saw similar segfaults and after all memory tests have passed successfully, I realised that the CPU temperature arose dramatically and the dissipation capacity of the cooler has been insufficient. Since LLVM/CLANG 4.0.0 is in the tree, I realise a dramatic temperature increase on my Lenovo ThinkPad Edge E540, which is equipted with a Intel i5-4200M. The temperature is something I observe very carefully. this might be o coincidence, but I have the imagination that compiler developers try to use the facilities a CPU provides to speed up compilation, so the performance is in relation to power consumption and therefore heat dissipation. On the other hand, I ripped off the CPU cooler and applied high quality thermal grease - and that dropped the CPU temperature from ~ 81 degree Celsius down to 66 - 72 degree Celsius within the same environment temperature and roughly the same OS revision (I did the grease application within one day and recompiled a complete world from scratch, again). So, to make it short: check the grease and thermal conductivity of your CPU cooler. Thermal grease is not long-term stable, the same is for thermal pads. They get brittle and loose thermal conductivity capabilities over several years of use, and faster when the CPU is stressed by overclocking.
If overheating of the CPU is causing segfaults (non-overclocked), your CPU is already damaged. Some stress test like Prime95 or IntelBurnTest should also reproduce the issue.
Have now tested on an old laptop (slow hardware so long waiting time) It has the exact same symptons. Stable when building 11.0 or 10.3 on older clang. Once on 11-STABLE, random segfaults on clang 4.0 Will test on the server class hardware at weekend, but given the results of this search and my significant testing of replacement ram etc. I think its a clang 4.0 issue. Has FreeBSD changed compiler version before historically on a STABLE branch? like it has on 11.0 to 11.1 now? google search "clang 4.0 segfault bug site:lists.llvm.org"
(In reply to Conrad Meyer from comment #10) it has no issue with prime95 stress tests and other stress tests. So to confirm absolutely 100% stable in every software on the system except clang 4.0 buildworld. The cpu temperature is fine and will within spec.
This is with buildworld running root@test 11s # sysctl dev.cpu |grep temper dev.cpu.3.temperature: 39.0C dev.cpu.2.temperature: 40.0C dev.cpu.1.temperature: 39.0C dev.cpu.0.temperature: 40.0C I will provide feedback saturday or sunday when I test on a EXSI instance, the host machine has ECC ram and a new XEON chip powering it. Also server class storage.
(In reply to Chris Collins from comment #11) If this were a general problem the build servers would not be able to build the releases, ports, and such. I do buildworld buildkernel for head on amd64, powerpc64, aarch64, armv7, and powerpc. I've not been having such problems. (I do cross builds amd64 -> <?> more than native but do on occasion build native for the others. My amd64 activity is under virtual box on either Windows 10 or macOS 10.12.5 at this point. The others are directly on the hardware that I have access to.) I build and run non-debug kernels normally despite running versions of head. If what you report was generally happening to others most FreeBSD activity that is clang 4 based would be largely "dead in the water" --but it is not. Almost certainly some uncommon property in other environments is a property of your environment and is involved. The problem is isolating what is involved. It may be time for detailed kernel config specifications. As I remember you already listed the src.config that you use (comment 6). None of my src.conf content matches any of yours. I do not have any 11.x environments at this point, just head based, currently -r320192 . If you have a failing environment that can use a pure GENERIC kernel config and a empty src.conf (or some match to a well established set of such files), you might want to try such. If it happens to work okay then it would form the starting point of a search for what makes the difference. By contrast if things still fail this gets much harder to track down. I can supply examples of my config files if needed but I do not have defaults. (Just using clang 4 for targeting powerpc64 or for powerpc is odd in the first place: I gather evidence of issues that I discover and report them, generally to llvm.) I do have a few source file differences associated with the experiments on non-amd64 --historically mostly tied to powerpc64 and powerpc. (Note: Actually powerpc (32-bit) has problems with crashing even when sitting idle in my context, even if built with gcc 4.2.1. I've had crashes in minutes --or up to somewhat over 10 days 8 hours later. Usually it has been hours but less than 9 hours. But use of clang need not be involved at all for this so it is not a fit to your context. And no other of my environments has shown such behavior so far.)
(In reply to Mark Millard from comment #14) My paragraph: "If this were a general problem the build servers would not be able to build the releases, ports, and such." was poorly chosen. I should have referred to just test builds that are based on head, stable/11, or the drafts of 11.1 . (I expect that there have been many.) These likely start with projects/clang*-import/ testing and continue with head, stable/11, and the 11.1 drafts. The official of releases and such likely are still based on an older context building the newer context. I do not know if they build and use a bootstrap clang 4 and then use it or not when the target is head, stable/11, or an 11.1 draft version of some kind. It could be that only the system compiler is built and installed but not used for anything relative to buildworld buildkernel activity. As I understand exp-runs were made for building ports that were based on clang 4. This might still be on-going. My own activity is incremental updates of head, so using clang 4 to build a bootstrap compiler that is clang 4 when needed. Then using the resultant clang 4 either way. (I ignore here experimenting with devel/*xtoolchain* or using gcc 4.2.1 where I have to [32-bit powerpc kernel that finishes booting correctly].) There is also likely activity of other people working based on clang 4, including buildworld, buildkernel, and building ports (ports that do not force some gcc or some other toolchain). I expect there is still enough activity based on clang 4 that my overall argument structure still holds: It would be good to try something that matches a well used, well established build configuration overall and see what the status is for that build configuration. I'll note that my activity is mostly based on system-clang, not devel/llvm40 clang. Although I have attempted devel/xtoolchain-llvm40 for buildworld and buildkernel when there were unusual failures like missing routines in linking. (So far system-clang and devel/xtoolchain-llvm40 have matched for such build issues. But I've rarely tried this.)
I am not insisting its not hardware and I continue to persue the hardware route. I am about to go bed as is 4am here, but I upped the vcore on my cpu and dram voltage on the system and done 2 buildworlds since with no segfaults, it is an old cpu so is possible voltage degradation has occurred to the point that stock voltage is not enough to be stable which is why I have raised the voltage. I will start another buildworld now which will be a third, if it succeeds it will be the first time 3 have worked in a row. It is still on the GENERIC kernel as well. I will also do more runs tomorrow with an empty src.conf. If these new runs all work (with increased voltage and of course also is good on my xeon), then yes I accept that as a hardware issue, and is possible my old laptop may have similar issues as that is old as well. :)
Perhaps buildworld with clang 4.0 is now the ultimate hardware stability test :) 3rd compile was fine, now running 4th. Will still test on the server class hardware this weekend. So it seems the diagnosis here is that clang 4.0 works the cpu harder so it is more likely to show up stability problems than clang 3.x?
Ok a further update. After a reboot, the i5 750 machine started getting segfaults again, a few reboots later I have discovered the behaviour is fairly consistent, where a rolld o the dice occurs on a reboot, usually if the first buildworld has no problem I can probably do 3+ in a row with no segfault, but if the first has a segfault then I will struggle to get just one successful buildworld. I discovered the LAPIC timer on my laptop is broken, aided by a warning on the console, when I switched it to i8254 the problem stopped. I then fresh installed 11.0 again and discovered on 11.0 it uses i8254 by default but on 11-STABLE it uses LAPIC, when LAPIC is used I see some other odd behaviours e.g. systat -v 1 will update really slowly. I then checked on my i5 750 on 11.0 it uses LAPIC by default and seems to work ok, on 11-STABLE LAPIC has the same issues as the laptop and it defaults to HPET. At the time of this post I havent tried a buildworld using a non default timer, but I am running buildworld now using i8254 on the i5 750 to see what results I get, I will run many times over multiple reboots. The VMWare hypervisor has no segfault problems and uses LAPIC by default working fine on 11.0 and 11-STABLE. All the current tests are with empty src.conf aside from 'LOADER_ZFS_SUPPORT=YES'' and no CPUTYPE defined to try and simplify the diagnosis.
So to confirm as I dont think I written it well, using i8254 on my laptop I dont get segfaults. The default timer changed between 11.0 and 11-STABLE. I also meant "roll of the dice" but typod.
(In reply to Chris Collins from comments #18 and #19) Interesting --and non-obvious. From what I've read Message Signaled Interrupts (MSI) from PCI 2.2+ depend on LAPIC, requiring LAPIC to be enabled. If LAPIC is not working correctly then MSI might not work fully correctly either and so should be avoided in such a context? (I'm not familiar with the details in this area. Take the above as hear-say.)
No issues on the i5 750 now as well across 4 reboots and 13 buildworlds. I may raise a new bug regarding the timers, as I had to as well adjust the timecounter on my laptop to get C states working, its default kept it in C1 all the time, so seems is weird eventtimer and timecounter issues on older hardware. The VMWare machine which has no is is a 2016 cpu. The i5 750 cpu was released in 2009 The laptop cpu is a core 2 duo T5750 released in 2008 Thanks guys for your help.
(In reply to Mark Millard from comment #20) Thanks, the laptop isnt using MSIX, or MSI anyway so I am ok on that, I will have a look at the i5 750 dmesg to see if MSI or MSIX is used.
This has been a very long time ago, and clang 4.0.0 is now long gone.