Bug 221029 - AMD Ryzen: strange compilation failures using poudriere or plain buildkernel/buildworld
Summary: AMD Ryzen: strange compilation failures using poudriere or plain buildkernel/...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on: 219399
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-26 17:27 UTC by Nils Beyer
Modified: 2019-07-09 22:55 UTC (History)
6 users (show)

See Also:


Attachments
ryzen_stress_test script - executes buildkernel/buildworld in TMPFS endlessly. Should generate "unable to rename temporary" errors at some point... (566 bytes, application/x-sh)
2017-07-26 17:30 UTC, Nils Beyer
no flags Details
logs of failed poudriere builds (238.52 KB, application/x-xz)
2017-07-26 17:44 UTC, Nils Beyer
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nils Beyer 2017-07-26 17:27:18 UTC
Hi,

with reference to:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399

on an AMD Ryzen system there are strange compilation errors consisting of

    - bus errors
    - segmentation faults
    - unable to rename temporary 'XXX' to 'YYY' using TMPFS

running poudriere bulk builds or plain buildkernel/buildworld

These compilations will succeed with a high probability if you start them again.

The root cause to that is still unknown. Related threads in AMD's forum and elsewhere can be found here:

    https://community.amd.com/thread/215773
    https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads
    https://forums.gentoo.org/viewtopic-t-1061546.html
Comment 1 Nils Beyer 2017-07-26 17:30:33 UTC
Created attachment 184743 [details]
ryzen_stress_test script - executes buildkernel/buildworld in TMPFS endlessly. Should generate "unable to rename temporary" errors at some point...

If a buildkernel/buildworld run fails, the log file will be copied into the user's home directory...
Comment 2 Nils Beyer 2017-07-26 17:37:19 UTC
(In reply to Don Lewis from the other ticket)

> Too bad you're not running ECC RAM, that would eliminate one potential silent cause of strange behavior.

I do have ECC RAM on my second Ryzen system - that one with the 16GB RAM which is too low for poudriere builds. And there, I get these "unable to rename temporary" errors while buildkernel/buildworld stress (using TMPFS), too.

For the completeness, I'm doing my stress_test buildkernel/buildworld on that system now without TMPFS. Will let you know how it goes. I guess that are no such errors...
Comment 3 Nils Beyer 2017-07-26 17:44:13 UTC
Created attachment 184745 [details]
logs of failed poudriere builds

attached log files of some failed poudriere builds
Comment 4 Don Lewis freebsd_committer 2017-07-26 17:52:28 UTC
I sometimes see processes spuriously returning a non-zero exit status:

gmake[4]: *** [Makefile:827: gmime-object.lo] Error 1
gmake[4]: *** Waiting for unfinished jobs....

The compiler did not report any errors.  No sign of the compiler core dumping in /var/log/messages at the time of that error.

This issue is rare, but I've seen it a few times before.
Comment 5 Nils Beyer 2017-07-26 18:49:11 UTC
Before I forget, here's my "poudriere.conf":
---------------------------------------------------------------------------------
ZPOOL=asbach
FREEBSD_HOST=ftp://ftp2.de.freebsd.org
RESOLV_CONF=/etc/resolv.conf
BASEFS=/usr/local/poudriere
USE_PORTLINT=no
USE_TMPFS=yes
DISTFILES_CACHE=/usr/ports/distfiles
PARALLEL_JOBS=15
BUILD_AS_NON_ROOT=no
ALLOW_MAKE_JOBS_PACKAGES="pkg ccache py* gcc* llvm* ghc* *webkit* *office* chromium* iridium*"
---------------------------------------------------------------------------------
Comment 6 Nils Beyer 2017-07-26 18:50:45 UTC
(In reply to Don Lewis from comment #4)

do you mean with "seen before" before you've got your Ryzen CPU or after?
Comment 7 Nils Beyer 2017-07-26 18:55:02 UTC
FWIW, I also get MCA messages from the kernel, like this one:
---------------------------------------------------------------------------------
MCA: Bank 1, Status 0xd0200000000b0151
MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 4
MCA: CPU 4 COR OVER ICACHE L1 IRD error

/var/log/messages.2.bz2:Jul 23 22:42:14 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.2.bz2:Jul 23 22:42:14 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.2.bz2:Jul 23 22:42:14 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 13
/var/log/messages.2.bz2:Jul 23 22:42:14 asbach kernel: MCA: CPU 13 COR ICACHE L1 IRD error

/var/log/messages.16.bz2:Jul  5 13:28:55 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.16.bz2:Jul  5 13:28:55 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.16.bz2:Jul  5 13:28:55 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
/var/log/messages.16.bz2:Jul  5 13:28:55 asbach kernel: MCA: CPU 14 COR ICACHE L1 IRD error

/var/log/messages.16.bz2:Jul  5 19:28:55 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.16.bz2:Jul  5 19:28:55 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.16.bz2:Jul  5 19:28:55 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 10
/var/log/messages.16.bz2:Jul  5 19:28:55 asbach kernel: MCA: CPU 10 COR ICACHE L1 IRD error

---------------------------------------------------------------------------------

CPU number is alway different; rest stays the same. I get these on another Ryzen system, too. The latter one has ECC RAM and was assembled three months after my first system...
Comment 8 Nils Beyer 2017-07-26 19:04:50 UTC
Here are the MCA messages from my other Ryzen system (the one with ECC RAM):
------------------------------------------------------------------------------------------------
/var/log/messages.17.bz2:Jul  8 04:52:05 capetown kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.17.bz2:Jul  8 04:52:05 capetown kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.17.bz2:Jul  8 04:52:05 capetown kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 10
/var/log/messages.17.bz2:Jul  8 04:52:05 capetown kernel: MCA: CPU 10 COR ICACHE L1 IRD error

/var/log/messages.20.bz2:Jul  5 13:45:31 capetown kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.20.bz2:Jul  5 13:45:31 capetown kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.20.bz2:Jul  5 13:45:31 capetown kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 10
/var/log/messages.20.bz2:Jul  5 13:45:31 capetown kernel: MCA: CPU 10 COR ICACHE L1 IRD error
------------------------------------------------------------------------------------------------

these occured during buildkernel/buildworld compilation stress tests. I'll try provoke some more; in order to get another core number...
Comment 9 Nils Beyer 2017-07-26 19:26:25 UTC
(In reply to Nils Beyer from comment #2)

believe it or not, I also get these
--------------------------------------------------------------------------------
error: unable to rename temporary 'OptParserEmitter.o-a86dcd92' to output file 'OptParserEmitter.o': 'No such file or directory'
--------------------------------------------------------------------------------

while _not_ using TMPFS; here's the script I'm running:
--------------------------------------------------------------------------------
#!/bin/sh

OBJDIR="/tmp/ryzen_stress_test"

trap "exit 1" 1 2 3

cd /usr/src
mkdir ${OBJDIR}

while [ 1 ]; do
        echo "`date` begin"
        BEG="`date +%s`"
#       umount ${OBJDIR} ; umount ${OBJDIR} ; umount ${OBJDIR}
#       mount -t tmpfs tmpfs ${OBJDIR} || exit 1
        rm -rf ${OBJDIR}
        mkdir ${OBJDIR}
        export MAKEOBJDIRPREFIX=${OBJDIR}
        make -j20 buildworld buildkernel >${OBJDIR}/${BEG}.log 2>&1
        ERR="$?"
        echo "`date` end - errorcode ${ERR}"
        [ "${ERR}" != "0" ] && cp ${OBJDIR}/${BEG}.log ~/.
        rm ${OBJDIR}/${BEG}.log
done
--------------------------------------------------------------------------------
Comment 10 Don Lewis freebsd_committer 2017-07-26 20:24:19 UTC
(In reply to Nils Beyer from comment #6)
Only on the Ryzen, maybe a week ago, before moving the shared page.  I've never seen it on my AMD FX-8320E.
Comment 11 Don Lewis freebsd_committer 2017-07-26 20:29:13 UTC
(In reply to Nils Beyer from comment #7)
I've gotten one of these:
Jul 22 17:17:17 speedy kernel: MCA: Bank 1, Status 0x90200000000b0151
Jul 22 17:17:17 speedy kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
Jul 22 17:17:17 speedy kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
Jul 22 17:17:17 speedy kernel: MCA: CPU 14 COR ICACHE L1 IRD error

The hardware corrected the error.  These should be benign.
Comment 12 Don Lewis freebsd_committer 2017-07-26 20:33:42 UTC
(In reply to Nils Beyer from comment #9)
I'm pretty sure I saw one while doing a buildworld on ZFS as well.  I think these errors occur less frequently on ZFS.  I just did two poudriere runs with tmpfs disabled and didn't see this error.

My first suspicion is that this could be race condition in our code exposed by more parallelism.

Which version of the share page patch are you running?  Earlier you mentioned not seeing this on the machine using the original version.
Comment 13 Nils Beyer 2017-07-26 20:57:13 UTC
(In reply to Don Lewis from comment #12)

> I'm pretty sure I saw one while doing a buildworld on ZFS as well.  I think these errors occur less frequently on ZFS.  I just did two poudriere runs with tmpfs disabled and didn't see this error.

try my buildkernel/buildworld "ryzen_stress_test.sh" script - let it run for 24h. Execute with:

    /usr/bin/nohup sh ryzen_stress_test.sh &

and hope for a "nohup.out" file like this:
-----------------------------------------------------------------------------
mkdir: /tmp/ryzen_stress_test: File exists
Wed Jul 26 19:23:09 CEST 2017 begin
Wed Jul 26 19:45:04 CEST 2017 end - errorcode 0
Wed Jul 26 19:45:04 CEST 2017 begin
Wed Jul 26 20:07:06 CEST 2017 end - errorcode 0
Wed Jul 26 20:07:06 CEST 2017 begin
Wed Jul 26 20:29:09 CEST 2017 end - errorcode 0
Wed Jul 26 20:29:09 CEST 2017 begin
Wed Jul 26 20:44:52 CEST 2017 end - errorcode 2
Wed Jul 26 20:44:52 CEST 2017 begin
Wed Jul 26 21:06:52 CEST 2017 end - errorcode 0
Wed Jul 26 21:06:52 CEST 2017 begin
Wed Jul 26 21:28:55 CEST 2017 end - errorcode 0
Wed Jul 26 21:28:55 CEST 2017 begin
Wed Jul 26 21:50:57 CEST 2017 end - errorcode 0
Wed Jul 26 21:50:57 CEST 2017 begin
Wed Jul 26 22:13:00 CEST 2017 end - errorcode 0
Wed Jul 26 22:13:00 CEST 2017 begin
Wed Jul 26 22:35:00 CEST 2017 end - errorcode 0
Wed Jul 26 22:35:00 CEST 2017 begin
-----------------------------------------------------------------------------


> My first suspicion is that this could be race condition in our code exposed by more parallelism.

I don't think so because this does happen in poudriere builds, too. These builds are mainly single-thread builds - "kf5-kservice-5.36.0" for instance generated that though it is single-threaded. And for buildkernel/buildworld, this does not happen on my Intel system with the same number of threads (20)


> Which version of the share page patch are you running?

this one:
-------------------------------------------------------------------------------
Index: sys/amd64/include/vmparam.h
===================================================================
--- sys/amd64/include/vmparam.h (revision 321399)
+++ sys/amd64/include/vmparam.h (working copy)
@@ -176,7 +176,7 @@
 
 #define        VM_MAXUSER_ADDRESS      UVADDR(NUPML4E, 0, 0, 0)
 
-#define        SHAREDPAGE              (VM_MAXUSER_ADDRESS - PAGE_SIZE)
+#define        SHAREDPAGE              (VM_MAXUSER_ADDRESS - 2*PAGE_SIZE)
 #define        USRSTACK                SHAREDPAGE
 
 #define        VM_MAX_ADDRESS          UPT_MAX_ADDRESS
-------------------------------------------------------------------------------


> Earlier you mentioned not seeing this on the machine using the original version.

I think you mean this comment here:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c127

I haven't seen them yet at that time - but they appeared in a following poudriere session...
Comment 14 Nils Beyer 2017-07-26 21:12:08 UTC
(In reply to Don Lewis from comment #12)

> Earlier you mentioned not seeing this on the machine using the original version.

Ah, okay, sorry; you surely meant this comment here:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c95

yes, no "unable to rename" crap there while using your full patch; but these were "only" 21 passes. Shortly after that I removed your mega-patch - because someone said it is evil - and applied the one-liner version of it. So, no definite answer whether your evil patch bypasses the UTR-problem (Unable-To-Rename) or not...
Comment 15 Nils Beyer 2017-07-26 21:15:22 UTC
(In reply to Nils Beyer from comment #14)

> So, no definite answer whether your evil patch bypasses the UTR-problem (Unable-To-Rename) or not...

and here I give me the definite answer myself:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c109

*sigh*

that ticket of mine has become too long...
Comment 16 Nils Beyer 2017-07-27 18:58:44 UTC
Finally, my poudriere run went completely through - without any system freezes or unexpected reboots:

    http://46.245.217.106:10080/build.html?mastername=11_1-default&build=2017-07-26_01h13m51s

got an MCA though:
-------------------------------------------------------------------------------
Jul 27 10:08:38 asbach kernel: pid 64716 (conftest), uid 0: exited on signal 10
Jul 27 10:09:41 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
Jul 27 10:09:41 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
Jul 27 10:09:41 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 9
Jul 27 10:09:41 asbach kernel: MCA: CPU 9 COR ICACHE L1 IRD error
Jul 27 10:28:53 asbach kernel: pid 67290 (doxygen), uid 0: exited on signal 6
-------------------------------------------------------------------------------

and "chromium" failed because of a strange SIGTRAP and
-------------------------------------------------------------------------------
[0726/093727.533851:FATAL:ref_counted.cc(26)] Check failed: in_dtor_. RefCountedThreadSafe object deleted without calling Release()
*** Signal 5
-------------------------------------------------------------------------------

I'm now running another poudriere incremental build with the same ports tree and jail to check which ports will built successfully now...
Comment 17 Don Lewis freebsd_committer 2017-07-27 20:12:30 UTC
A buildworld on with objdir on tmpfs and src on zfs failed for me this new way:

===> usr.bin/xstr (cleandir)
--- cleandir_subdir_usr.bin/readelf ---
--- cleanobj ---
--- cleandir_subdir_usr.bin/nm ---
--- cleanobj ---
--- cleandir_subdir_usr.bin/vi ---
===> usr.bin/vi (cleandir)
--- cleandir_subdir_usr.bin/lex ---
--- cleanobj ---
--- cleandir_subdir_usr.bin/lex/lib ---
===> usr.bin/lex/lib (cleandir)
--- cleandir_subdir_usr.bin/yacc ---
===> usr.bin/yacc (cleandir)
--- cleandir_subdir_usr.bin/vtfontcvt ---
===> usr.bin/vtfontcvt (cleandir)
--- cleandir_subdir_usr.bin/usbhidaction ---
===> usr.bin/usbhidaction (cleandir)
--- cleandir_subdir_usr.bin/lex ---
make[5]: Cannot open `.' (No such file or directory)
*** [cleandir_subdir_usr.bin/lex/lib] Error code 1
Comment 18 Nils Beyer 2017-07-27 20:43:25 UTC
(In reply to Don Lewis from comment #17)

did that happen by accident or did you try my "ryzen_stress_test.sh" script?
Comment 19 Mateusz Guzik freebsd_committer 2017-07-27 21:13:11 UTC
can you reproduce with this crude debug patch in?

https://people.freebsd.org/~mjg/cache-debug.diff

if anything goes wrong it will print to dmesg lines starting with cache_enter_time

there is some fishy code which may or may not be buggy in a way which gives this problem. if you run into the failure without the message showing up, i don't know what this can be
Comment 20 Mateusz Guzik freebsd_committer 2017-07-27 21:29:28 UTC
to be clear, there are 2 parts:
1. possible problem where an already present negative entry causes dropping the new positive entry on the floor, which gives ENOENT on lookup
2. dot lookup can only fail in the namecache if the vnode is VI_DOOMED, which normally should not happen if you don't try to force it. what could be wrong if this is the problem i don't know yet. can't reproduce myself.
Comment 21 Nils Beyer 2017-07-27 21:33:24 UTC
(In reply to Mateusz Guzik from comment #20)

> [...] can't reproduce myself.

do you have a Ryzen system?
Comment 22 Mateusz Guzik freebsd_committer 2017-07-27 21:35:33 UTC
no, but there are legitimate reasons to suspect this is an actual bug in the os
Comment 23 Don Lewis freebsd_committer 2017-07-27 22:15:38 UTC
(In reply to Nils Beyer from comment #18)
I'm running a variation of the stress test.  I just got another, similar, failure.


--- cleandir_subdir_usr.bin/su ---
===> usr.bin/su (cleandir)
--- cleandir_subdir_usr.bin/split ---
--- cleanobj ---
--- cleandir_subdir_usr.sbin ---
--- cleandir_subdir_usr.sbin/bsdconfig ---
--- cleandir_subdir_usr.sbin/bsdconfig/includes ---
--- cleandepend ---
--- cleandir_subdir_usr.sbin/bsdconfig/networking ---
===> usr.sbin/bsdconfig/networking (cleandir)
--- cleandir_subdir_lib ---
make[5]: Cannot open `.' (No such file or directory)
--- cleandir_subdir_usr.sbin ---
Comment 24 Nils Beyer 2017-07-27 22:23:53 UTC
(In reply to Don Lewis from comment #23)

> [...] I'm running a variation of the stress test.

now, that sounds daring - want to share?
Comment 25 Don Lewis freebsd_committer 2017-07-27 22:28:17 UTC
Nothing real magical.  Just very stripped down and it bails out if a build fails so the the state of the obj tree can be examined.  You have to the top of the src tree yourself, and /mnt/x is tmpfs.

#!/bin/sh
a=1
while MAKEOBJDIRPREFIX=/mnt/x make -j18 buildworld buildkernel  > /tmp/buildworld.out 2>&1; do
echo $a
a=`expr $a + 1`
done
Comment 26 Nils Beyer 2017-07-27 22:57:02 UTC
(In reply to Don Lewis from comment #25)

http://netdna.webdesignerdepot.com/uploads/2013/05/11.jpg
Comment 27 Don Lewis freebsd_committer 2017-07-27 23:06:49 UTC
(In reply to Mateusz Guzik from comment #19)
--- test_04.cleandir ---
(cd /usr/src/lib/libxo/tests &&  DEPENDFILE=.depend.test_04  NO_SUBDIR=1 make -f
 Makefile _RECURSING_PROGS=t  PROG=test_04  cleandir)
--- cleandir_subdir_usr.sbin ---
--- cleandir_subdir_usr.sbin/ipfwpcap ---
===> usr.sbin/ipfwpcap (cleandir)
--- cleandir_subdir_usr.bin ---
--- cleandir_subdir_usr.bin/tar ---
make[5]: Cannot open `.' (No such file or directory)
--- cleandir_subdir_lib ---

nothing in dmesg ...
Comment 28 Nils Beyer 2017-07-27 23:10:04 UTC
(In reply to Nils Beyer from comment #16)

as excpected, several ports now built successfully (whereas they didn't in the previous run):

    http://46.245.217.106:10080/build.html?mastername=11_1-default&build=2017-07-27_20h34m59s

Diff'ing the two error log directories:
-------------------------------------------------------------------------------
#diff -qr /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-2*/logs/errors | grep "Only in"
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: blackbox_exporter-0.7.0.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: chromium-59.0.3071.115_2.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: docker-17.06.0.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: docker-freebsd-20150625_1.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: grafana4-4.3.2_1.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: jogl-1.1.1_6.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: jython-2.7.0.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: kf5-kservice-5.36.0.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: libbfd-2.19.1_2.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: packer-1.0.3.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-27_20h34m59s/logs/errors: scilab-5.5.2_8.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: webkit2-gtk3-2.16.6.log
Only in /usr/local/poudriere/data/logs/bulk/11_1-default/2017-07-26_01h13m51s/logs/errors: zh_cn-freebsd-doc-50402,1.log
-------------------------------------------------------------------------------

the previous errors were:

- fatal error: MSpan_Sweep: bad span state

- [0726/093727.533851:FATAL:ref_counted.cc(26)] Check failed: in_dtor_. RefCountedThreadSafe object deleted without calling Release()

- fatal error: freeIndex is not valid

- Killing runaway build after 7200 seconds with no output

- failed MSpanList_InsertBack 0x800aa4688 0x800b448d8 0x0 0x0
fatal error: MSpanList_InsertBack

- pkg-static: Unable to access file [...]

- Killing runaway build after 7200 seconds with no output

- error: unable to rename temporary (my favorite)

- *** Error code 1 (no specific message)

- failed MSpanList_InsertBack 0x800b47fb8 0x800b47930 0x0github.com/hashicorp/packer/vendor/github.com/kr/fs
 0x0
fatal error: MSpanList_InsertBack

- error: unable to rename temporary (I like)

- # A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0xa) at pc=0x0000000801d0d233, pid=31769, tid=0x0000000000018dc9


Very, very mysterious - as these messages are gone now.


Tomorrow, I'll do as AMD support suggests and clear my CMOS by removing all cables (power, USB and LAN) as well as the CMOS battery, wait 15 minutes, plug in everything back and set VCORE to 1.365V - and then I'll do a complete poudriere run (with cleared packages repository) again...
Comment 29 Nils Beyer 2017-07-28 06:35:33 UTC
(In reply to Mateusz Guzik from comment #19)

I applied your debug patch on my buildkernel/buildworld system and started the stress test. It failed five times out of 28 passes:
-------------------------------------------------------------------------------
1501195693.log:error: unable to rename temporary 'Plugins/Language/ObjC/NSArray.o-80b58f5c' to output file 'Plugins/Language/ObjC/NSArray.o': 'No such file or directory'
1501197724.log:error: unable to rename temporary 'Frontend/ModuleDependencyCollector.o-d70d921a' to output file 'Frontend/ModuleDependencyCollector.o': 'No such file or directory'
1501200864.log:error: unable to rename temporary 'Transforms/Scalar/PartiallyInlineLibCalls.o-ec3a6d48' to output file 'Transforms/Scalar/PartiallyInlineLibCalls.o': 'No such file or directory'
1501203868.log:error: unable to rename temporary 'CodeGen/AsmPrinter/DwarfCFIException.o-9e2901e7' to output file 'CodeGen/AsmPrinter/DwarfCFIException.o': 'No such file or directory'
1501207952.log:error: unable to rename temporary 'StaticAnalyzer/Checkers/AnalyzerStatsChecker.o-d8b93824' to output file 'StaticAnalyzer/Checkers/AnalyzerStatsChecker.o': 'No such file or directory'
-------------------------------------------------------------------------------

but no messages in dmesg, unfortunately...
Comment 30 Nils Beyer 2017-07-28 07:29:30 UTC
Time for the action which AMD tech support requested; I've just answered them:
-----------------------------------------------------------------------------
Hi ...,

1) regarding system freezes and unexpected reboots
==================================================
it seems that the workaround patch by the FreeBSD developer Don Lewis fixes
the system freezes and unexpected reboots here on my system.

Please take a look here:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c89
    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c177


2) regarding segmentation faults and other unexpected failures during compilations
==================================================================================
the unexpected compilation failures are still there - despite the above mentioned
workaround patch. The strange thing is, that in a second run, these failed compilations
attempts magicially succeed, but some other will fail instead.

Please take a look here:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029#c28

So, I've followed your request; powered-down my system, removed all cables (power,
LAN and USB); removed my CMOS battery; let it sit for ten minutes; reattached every-
thing; powered-on my system; went into BIOS and set CPU voltage staticly to 1.3625V
(s. attachment)

Now, my system is up and I'll be running my compilation tests once more.


3) MCA messages in kernel
=========================
during my compilation tests I got several MCA messages in kernel log (on both of my Ryzen systems):
-------------------------------------------------------------------------------------------------------------------
/var/log/messages:Jul 28 00:09:41 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages:Jul 28 00:09:41 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages:Jul 28 00:09:41 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 9
/var/log/messages:Jul 28 00:09:41 asbach kernel: MCA: CPU 9 COR ICACHE L1 IRD error
/var/log/messages.0.bz2:Jul 27 10:09:41 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.0.bz2:Jul 27 10:09:41 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.0.bz2:Jul 27 10:09:41 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 9
/var/log/messages.0.bz2:Jul 27 10:09:41 asbach kernel: MCA: CPU 9 COR ICACHE L1 IRD error
/var/log/messages.1.bz2:Jul 26 05:09:41 asbach kernel: MCA: Bank 1, Status 0xd0200000000b0151
/var/log/messages.1.bz2:Jul 26 05:09:41 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.1.bz2:Jul 26 05:09:41 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 4
/var/log/messages.1.bz2:Jul 26 05:09:41 asbach kernel: MCA: CPU 4 COR OVER ICACHE L1 IRD error
/var/log/messages.18.bz2:Jul  5 13:28:55 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.18.bz2:Jul  5 13:28:55 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.18.bz2:Jul  5 13:28:55 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
/var/log/messages.18.bz2:Jul  5 13:28:55 asbach kernel: MCA: CPU 14 COR ICACHE L1 IRD error
/var/log/messages.18.bz2:Jul  5 19:28:55 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.18.bz2:Jul  5 19:28:55 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.18.bz2:Jul  5 19:28:55 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 10
/var/log/messages.18.bz2:Jul  5 19:28:55 asbach kernel: MCA: CPU 10 COR ICACHE L1 IRD error
/var/log/messages.4.bz2:Jul 23 22:42:14 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.4.bz2:Jul 23 22:42:14 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.4.bz2:Jul 23 22:42:14 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 13
/var/log/messages.4.bz2:Jul 23 22:42:14 asbach kernel: MCA: CPU 13 COR ICACHE L1 IRD error

/var/log/messages.19.bz2:Jul  8 04:52:05 capetown kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.19.bz2:Jul  8 04:52:05 capetown kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.19.bz2:Jul  8 04:52:05 capetown kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 10
/var/log/messages.19.bz2:Jul  8 04:52:05 capetown kernel: MCA: CPU 10 COR ICACHE L1 IRD error
/var/log/messages.22.bz2:Jul  5 13:45:31 capetown kernel: MCA: Bank 1, Status 0x90200000000b0151
/var/log/messages.22.bz2:Jul  5 13:45:31 capetown kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
/var/log/messages.22.bz2:Jul  5 13:45:31 capetown kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 10
/var/log/messages.22.bz2:Jul  5 13:45:31 capetown kernel: MCA: CPU 10 COR ICACHE L1 IRD error
-------------------------------------------------------------------------------------------------------------------

I think that's worth mentioning...



Kind regards,
-----------------------------------------------------------------------------
Comment 31 Don Lewis freebsd_committer 2017-07-28 08:07:55 UTC
(In reply to Don Lewis from comment #27)
It even falis with:
# sysctl debug.vfscache=0
debug.vfscache: 1 -> 0

Time to add some more debug ...

I'm not seeing any MCA messages.
Comment 32 Nils Beyer 2017-07-28 11:23:16 UTC
(In reply to Nils Beyer from comment #30)

well, after staticially setting my VCORE voltage to 1.36250V, my system paniced (that's something new for sure).

But because of the panic, please continue the discussion here:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c182

I've raised the voltage as suggested by AMD tech support and try again...
Comment 33 Don Lewis freebsd_committer 2017-07-28 18:12:08 UTC
I haven't gotten another "Cannot open `.'" error, but I did get the rename error overnight.  Nothing was logged.  Post-mortem checking of the obj tree shows that the temp file was not present.  Unfortunately clang deletes it if rename() fails, so I don't know if it was missing before the rename.  I've commented out that line of code and will retest.
Comment 34 Don Lewis freebsd_committer 2017-07-29 03:36:24 UTC
I'm pretty sure that the "Cannot open `.'" error is not Ryzen-specific.  It's coming from:

        if (node->tn_links < 1)
                return (ENOENT);

in tmpfs_open().

Basically the current directory is getting deleted out from under make.  I think that UFS handles this more gracefully, but NFS and tmpfs don't.  This doesn't show up with ryzen_stress_test because it starts each iteration with a fresh tmpfs OBJDIR, whereas in my testing I reuse the same OBJDIR each time.

I think it is fixed by the build framework change in r321445.
Comment 35 Don Lewis freebsd_committer 2017-07-29 17:08:34 UTC
Last night I upgraded to 12.0-CURRENT r321674 to pick up the build framework change, the timehands change, and the libarchive change.  Since then, my machine has successfully completed 26 buildworld/buildkernel interations on tmpfs.

My instrumentation has picked up what looks like another framework problem where a directory is being deleted while still in use, but it doesn't break the build.  I'm planning on tracking it down at some point and filing a PR.

No sign of the rename problem, though.  I should probably add some more instrumentation besides what I have already added to try to further pinpoint it if and when it occurs again.
Comment 36 Don Lewis freebsd_committer 2017-08-01 00:31:33 UTC
I finally had a chance to do another poudriere run.  I saw three failures:

  * gnucash - guild segmentation fault

  * go - entersyscall inconsistent 0xc4207acea8 [0xc4207ac000,0xc4207ae000]

  * thunderbird - some clang segmentation faults plus:

cc -fno-strict-aliasing -O2 -pipe -fstack-protector -fno-strict-aliasing -DNDEBUG -D_GLIBCXX_USE_C99 -D_GLIBCXX_USE_C99_MATH_TR1 -D_DECLARE_C99_LDBL_MATH -DLIBICONV_PLUG -isystem /usr/local/include -fPIC -DPSUTIL_VERSION=311 -I/usr/local/include/python2.7 -c psutil/_psutil_bsd.c -o build/temp.freebsd-12.0-CURRENT-amd64-2.7/psutil/_psutil_bsd.o
psutil/_psutil_bsd.c:958:39: error: no member named 'xt_tp' in 'struct xtcpcb'
        tp = &((struct xtcpcb *)xig)->xt_tp;
              ~~~~~~~~~~~~~~~~~~~~~~  ^
psutil/_psutil_bsd.c:959:13: warning: incompatible pointer types assigning to 'struct inpcb *' from 'struct xinpcb *' [-Wincompatible-pointer-types]
        inp = &((struct xtcpcb *)xig)->xt_inp;
            ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
psutil/_psutil_bsd.c:960:39: error: no member named 'xt_socket' in 'struct xtcpc
b'
        so = &((struct xtcpcb *)xig)->xt_socket;
              ~~~~~~~~~~~~~~~~~~~~~~  ^
psutil/_psutil_bsd.c:969:33: error: incomplete definition of type 'struct inpcb'
                    AF_INET, inp->inp_lport, &inp->inp_laddr,
                             ~~~^
/usr/include/netinet/in_pcb.h:65:22: note: forward declaration of 'struct inpcb' LIST_HEAD(inpcbhead, inpcb);

[SNIP]

All three ports built sucessfully when I retried them.

No rename errors, though I had inadvertently disabled tmpfs for this run.

I'm running 12.0-CURRENT #53 r321732M with my pending sharedpage patch.
Comment 37 Don Lewis freebsd_committer 2017-08-01 05:58:45 UTC
Results of another poudriere run, this time using tmpfs.  Fix failed ports:

  * guile2 - guild segfault

  * wkhtmltopdf - c++ abort

  * libreoffice - c++ abort

  * go - go-bootstrap segfault

  * chromium - compiler rename temp file failure
Comment 38 Don Lewis freebsd_committer 2017-08-01 05:59:49 UTC
s/Fix/Five/
Comment 39 Don Lewis freebsd_committer 2017-08-02 00:49:13 UTC
My machine seems to be really stable doing buildworld/buildkernel these days, even using tmpfs for OBJDIR.  No errors when I was running it last night.  I'm running it again today and am up to 18 error-free iterations so far today, which is the hottest day in my office so far this summer.

I don't know why this works while I get random failures when building ports with poudriere.
Comment 40 Don Lewis freebsd_committer 2017-08-02 04:22:08 UTC
After 26 successful buildworld/buildkernel iterations, I got a failed build cause by a clang assertion failure:

Assertion failed: (Idx < getNumArgs() && "Argument index out of range!"), functi
on getArgKind, file /usr/src/contrib/llvm/tools/clang/include/clang/Basic/Diagno
stic.h, line 1249.

Nothing go logged by the kernel other than:
Aug  1 21:17:10 speedy kernel: pid 61983 (cc), uid 0: exited on signal 6 (core dumped)


I was really hoping for a rename failure ...
Comment 41 Don Lewis freebsd_committer 2017-08-02 04:29:39 UTC
Just got another when I restarted the test:
--- asn1_HDB_Ext_PKINIT_hash.o ---
Assertion failed: (Idx < getNumArgs() && "Argument index out of range!"), functi
on getArgKind, file /usr/src/contrib/llvm/tools/clang/include/clang/Basic/Diagno
stic.h, line 1249.

Might be an ambient temperature problem, the room is pretty warm, though I just opened a window and there is some cooler outside air blowing across the machine.
Comment 42 Nils Beyer 2017-08-02 17:34:10 UTC
(In reply to Don Lewis from comment #41)

Don, do you have to possibility to disable "OPCache Control" or anything that sounds like that in your BIOS?

I was able to build "ghc" for the first time since I've disabled that in BIOS:

http://46.245.217.106:10080/build.html?mastername=11_1-default&build=2017-08-02_13h12m59s

although I also OC'ed my CPU and RAM, so I don't know which one the modifications did that.

Successfully compiling stuff currently is not my primary goal; I want to get rid of the kernel panics. But I thought it might be interesting to you.

Maybe it's a two-part patch to get a stable system: first your user-page shifting and second the disabling of OPCache...
Comment 43 Don Lewis freebsd_committer 2017-08-02 18:11:45 UTC
Nope, my BIOS doesn't have a knob to disable the OPCache.  Interesting about ghc.

I just got another buildworld failure a little while ago after 26 successful iterations.  The same clang assertion failure, but with yet another source file:

--- rc2_cbc.o ---
Assertion failed: (Idx < getNumArgs() && "Argument index out of range!"), functi
on getArgKind, file /usr/src/contrib/llvm/tools/clang/include/clang/Basic/Diagno
stic.h, line 1249.

It doesn't seem to be temperature related since the room is much cooler than when I started this run last night.
Comment 44 Nils Beyer 2017-08-02 19:39:29 UTC
(In reply to Don Lewis from comment #43)

> Nope, my BIOS doesn't have a knob to disable the OPCache.

doesn't matter - my successful "ghc" build was just luck because I tried another build of "ghc" by executing "cd /usr/ports/lang/ghc && make" in a parallel console window; and I got that "bus error" again. So the arbitrariness is still going on.

Back to square one that is (I hope that's the right wording).


> I just got another buildworld failure a little while ago after 26 successful iterations.  The same clang assertion failure, but with yet another source file:

you haven't gotten these clang assertion failures before, right? So, something must have been changed on your side...
Comment 45 Don Lewis freebsd_committer 2017-08-03 05:06:15 UTC
I've never seen a successful amd64 ghc build.  i386, no problem.

I don't recall seeing that assert before, but it could have been hiding in the weeds.  It is strangely consistent, though.

One difference between my buildworld/buildkernel results, which were fairly good, and my poudriere runs, which had more fallout is that the latter have a much larger memory footprint.  The machine I'm testing has 64GB of RAM, with tons free during the buildworld/buildkernel test, even with tmpfs.  On the other hand, when I'm running poudriere with tmpfs, there is a lot of memory pressure and the machine starts using swap.  I'm wondering if page fault exceptions sometimes get mis-reported or reported somewhat differently than older CPUs, causing us to misinterpret them has hard sigsegv faults.  This fits the Linux symptoms, especially the bash segfaults.  If libtool is being used, then the bash process for each libtool execution will sit and wait for the compiler process that it spawed to finish.  In the meantime, the memmory used by bash will be idle is is likely to get paged out.  After the compiler exits and bash restarts, it is likely to experience a number of page faults.

Now that the system hang/crash problem is fixed for me, which accounted for many of the earlier failures that I saw, and which was not sensitive to SMT, or clock speed, I'm going to go back to some underclocking tests with SMT disabled to see if that has any effect when building ports.
Comment 46 Conrad Meyer freebsd_committer 2017-08-03 14:20:12 UTC
(In reply to Don Lewis from comment #45)
Speaking of user faults, have you tried testing after r321919?
Comment 47 Nils Beyer 2017-08-03 17:17:24 UTC
Is that here:

    http://fujii.github.io/2017/06/23/how-to-reproduce-the-segmentation-faluts-on-ryzen/

somewhat relevant, helpful, enlightening? How can the RIP CPU register slip?
Comment 48 Conrad Meyer freebsd_committer 2017-08-03 18:35:41 UTC
That link is interesting.  It looks like perhaps speculative execution is triggering real faults.

https://en.wikipedia.org/wiki/Speculative_execution

(Obviously, it shouldn't cause faults.)
Comment 49 Don Lewis freebsd_committer 2017-08-03 18:41:10 UTC
(In reply to Conrad Meyer from comment #46)
I have not tried r321919 yet.  That'll be my next experiment.  I'm currently running r321732.

I just got the results from my last poudriere experiment with SMT off and the CPU and RAM underclocked.  The only two unexpected failures were a guile sigsegv during the lang/guile2 build, and the usual ghc sigbus which has always happened for me on Ryzen.   When I restarted poudriere, guile2 and everything that depends on it built.
Comment 50 Don Lewis freebsd_committer 2017-08-03 18:56:17 UTC
(In reply to Conrad Meyer from comment #48)
Interesting link.  I've been hearing rumors about this, but this is the first detailed info that I've seen.

Speculative execution is one of the things that came to mind with the signal trampoline issue.  In this case, I would think that speculative execution would not get past the ret instruction.
Comment 51 Don Lewis freebsd_committer 2017-08-04 16:04:48 UTC
(In reply to Conrad Meyer from comment #46)
r321919 looks very promising.  I upgraded to r322026 and the only "unexpected" failure during my poudriere run was the lang/ghc sigbus failure, which has always failed for me on Ryzen.  It doesn't fix the hang/crash problem caused by executing  code just under the top of user memory.

I'll re-enable SMT for my next poudriere run to see if my luck holds.
Comment 52 Don Lewis freebsd_committer 2017-08-05 08:50:26 UTC
Testing poudriere with SMT on, I got some more fallout:
  
  * The usual ghc SIGBUS failure

  * clang aborted while building cmake.  Unfortunately the build output
    isn't very verbose and it looks like the assert message got sent to
    /dev/null so I don't know if it the same assert failure as I observed
    in the buildworld/buildkernel tests.

The cmake failure caused *many* ports to be skipped.  When I restarted poudriere, most stuff built successfully, but chromium failed due to the rename problem:

[273/350] CXX tools/gn/bundle_data.o
FAILED: tools/gn/bundle_data.o
clang++40 -MMD -MF tools/gn/bundle_data.o.d  -I/wrkdirs/usr/ports/www/chromium/w
ork/chromium-59.0.3071.115/out_bootstrap/gen -I/wrkdirs/usr/ports/www/chromium/w
ork/chromium-59.0.3071.115 -DNO_TCMALLOC -D__STDC_FORMAT_MACROS -O2 -g0 -D_FILE_
OFFSET_BITS=64 -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -pthread -pipe -f
no-exceptions -Wno-deprecated-register -Wno-parentheses-equality -std=c++11 -Wno
-c++11-narrowing -c /wrkdirs/usr/ports/www/chromium/work/chromium-59.0.3071.115/
tools/gn/bundle_data.cc -o tools/gn/bundle_data.o
error: unable to rename temporary 'tools/gn/bundle_data.o-d63e9b2a' to output fi
le 'tools/gn/bundle_data.o': 'No such file or directory'
1 error generated.


No segfaults, though.
Comment 53 Don Lewis freebsd_committer 2017-08-05 18:48:16 UTC
I change the CPU clock speed from 3.0 GHz to the default 3.4 GHz and things were not quite as rosey.  These were the failures:

 * ghc SIGBUS as usual

 * www/node - clang died with SIGABRT, but I did not find an assert message.

 * lang/go - one of the usual malloc arena consistency problems, this time
   failing with:

    failed MSpanList_InsertBack 0x8008bdb90 0x8008ba4f0 0x0 0x0
    fatal error: MSpanList_InsertBack

 * devel/py-singledispatch failed during extract with SIGBUS.  It looks
   like make was the victim.

 * editors/libreoffice failed due to a clang SIGABRT with no assert messages.

Interestingly both node and libreoffice core files indicate that c++ died at the same place.
Comment 54 Nils Beyer 2017-08-05 19:48:03 UTC
(In reply to Don Lewis from comment #53)

> Interestingly both node and libreoffice core files indicate that c++ died at the same place.

that sounds strange - the main feature of these failures is arbitrariness, and now you have some kind of equal repetition.

Do you have any possibility to generate that again - probably with a self-written program?

I cannot test this at moment as both my Ryzen boxes are frozen dead - and they are at my work-place where I'll be not until Monday.

The reason I ask is because the Linux users got some momentum due to two Phoronix articles about these segfaults:

    https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run
    https://www.phoronix.com/vr.php?view=25016

Unfortunately, the author relied on segfaults with "conftest" stuff which was not a very wise choice to prove that there is something wrong. "conftest" is meant to segfault, isn't it?

So, if we can provide something that reliably segfaults on an AMD but not on an Intel that would be something. Would be a pity if AMD finally says: "see, there is no problem with the Ryzen; you just have done it wrongly all the time"...
Comment 55 Don Lewis freebsd_committer 2017-08-05 21:57:41 UTC
It's not really repeatable at all.  I've seen a couple different clang aborts, but they are pretty rare, and happen when building different ports each time.  I did manage to get the proper environment set up so that I could analyze the latest clang core files.  Here's the stack backtrace for the libreoffice failure:

#0  0x0000000002df8efa in kill ()
#1  0x0000000002df8eae in __fail ()
#2  0x0000000002df8e1e in __stack_chk_fail_local ()
#3  0x0000000000a4cf46 in clang::FunctionProtoType::Profile(llvm::FoldingSetNodeID&, clang::ASTContext const&) ()
#4  0x0000000002c33ced in llvm::FoldingSetBase::GrowBucketCount(unsigned int)
    ()
#5  0x0000000002c33e09 in llvm::FoldingSetBase::InsertNode(llvm::FoldingSetBase::Node*, void*) ()
#6  0x0000000000b987ed in clang::ASTContext::getFunctionTypeInternal(clang::QualType, llvm::ArrayRef<clang::QualType>, clang::FunctionProtoType::ExtProtoInfo const&, bool) const ()
#7  0x0000000000b986b0 in clang::ASTContext::getFunctionTypeInternal(clang::QualType, llvm::ArrayRef<clang::QualType>, clang::FunctionProtoType::ExtProtoInfo const&, bool) const ()
#8  0x00000000014a1755 in clang::Sema::BuildFunctionType(clang::QualType, llvm::MutableArrayRef<clang::QualType>, clang::SourceLocation, clang::DeclarationName, clang::FunctionProtoType::ExtProtoInfo const&) ()
#9  0x00000000014f6c2c in clang::Sema::SubstFunctionDeclType(clang::TypeSourceInfo*, clang::MultiLevelTemplateArgumentList const&, clang::SourceLocation, clang::DeclarationName, clang::CXXRecordDecl*, unsigned int) ()
#10 0x00000000014e2bb2 in clang::TemplateDeclInstantiator::SubstFunctionType(clang::FunctionDecl*, llvm::SmallVectorImpl<clang::ParmVarDecl*>&) ()
#11 0x00000000014e02a7 in clang::TemplateDeclInstantiator::VisitCXXMethodDecl(clang::CXXMethodDecl*, clang::TemplateParameterList*, bool) ()
#12 0x00000000014dfc15 in clang::TemplateDeclInstantiator::VisitFunctionTemplateDecl(clang::FunctionTemplateDecl*) ()
#13 0x00000000014fa605 in clang::Sema::InstantiateClass(clang::SourceLocation, clang::CXXRecordDecl*, clang::CXXRecordDecl*, clang::MultiLevelTemplateArgumentList const&, clang::TemplateSpecializationKind, bool) ()
#14 0x00000000014fc81d in clang::Sema::InstantiateClassTemplateSpecialization(clang::SourceLocation, clang::ClassTemplateSpecializationDecl*, clang::TemplateSpecializationKind, bool) ()
#15 0x00000000014b2d39 in clang::Sema::RequireCompleteTypeImpl(clang::SourceLocation, clang::QualType, clang::Sema::TypeDiagnoser*) ()
#16 0x000000000109f55f in clang::Sema::IsDerivedFrom(clang::SourceLocation, clang::QualType, clang::QualType) ()
#17 0x00000000015f7ad5 in clang::Sema::IsPointerConversion(clang::Expr*, clang::QualType, clang::QualType, bool, clang::QualType&, bool&) ()
#18 0x00000000016218cb in IsStandardConversion(clang::Sema&, clang::Expr*, clang::QualType, bool, clang::StandardConversionSequence&, bool, bool) ()
#19 0x00000000015f5b17 in TryImplicitConversion(clang::Sema&, clang::Expr*, clang::QualType, bool, bool, bool, bool, bool, bool) ()
#20 0x0000000001603ed8 in TryCopyInitialization(clang::Sema&, clang::Expr*, clang::QualType, bool, bool, bool, bool) ()
#21 0x00000000016030c2 in clang::Sema::AddOverloadCandidate(clang::FunctionDecl*, clang::DeclAccessPair, llvm::ArrayRef<clang::Expr*>, clang::OverloadCandidateSet&, bool, bool, bool, llvm::MutableArrayRef<clang::ImplicitConversionSequence>) ()
#22 0x0000000001616956 in AddOverloadedCallCandidate(clang::Sema&, clang::DeclAccessPair, clang::TemplateArgumentListInfo*, llvm::ArrayRef<clang::Expr*>, clang::OverloadCandidateSet&, bool, bool) ()
#23 0x00000000016166d9 in clang::Sema::AddOverloadedCallCandidates(clang::UnresolvedLookupExpr*, llvm::ArrayRef<clang::Expr*>, clang::OverloadCandidateSet&, bool) ()
#24 0x0000000001616aba in clang::Sema::buildOverloadedCallSet(clang::Scope*, clang::Expr*, clang::UnresolvedLookupExpr*, llvm::MutableArrayRef<clang::Expr*>, clang::SourceLocation, clang::OverloadCandidateSet*, clang::ActionResult<clang::Expr*, true>*) ()
#25 0x0000000001616e90 in clang::Sema::BuildOverloadedCallExpr(clang::Scope*, clang::Expr*, clang::UnresolvedLookupExpr*, clang::SourceLocation, llvm::MutableArrayRef<clang::Expr*>, clang::SourceLocation, clang::Expr*, bool, bool) ()
#26 0x0000000000fe720b in clang::Sema::ActOnCallExpr(clang::Scope*, clang::Expr*, clang::SourceLocation, llvm::MutableArrayRef<clang::Expr*>, clang::SourceLocation, clang::Expr*, bool) ()
#27 0x00000000015064a8 in clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformCallExpr(clang::CallExpr*) ()
#28 0x000000000151cb12 in clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformReturnStmt(clang::ReturnStmt*) ()
#29 0x0000000001515243 in clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformCompoundStmt(clang::CompoundStmt*, bool) ()
#30 0x00000000014fd764 in clang::Sema::SubstStmt(clang::Stmt*, clang::MultiLevelTemplateArgumentList const&) ()
#31 0x00000000014e9c1c in clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) ()
#32 0x00000000014ecf2b in clang::Sema::PerformPendingInstantiations(bool) ()
#33 0x0000000001203c38 in clang::Sema::ActOnEndOfTranslationUnit() ()
#34 0x00000000012524a9 in clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&) ()
#35 0x00000000007dff95 in clang::ParseAST(clang::Sema&, bool, bool) ()
#36 0x00000000007d5bbc in clang::FrontendAction::Execute() ()
#37 0x0000000000d0f5f1 in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) ()
#38 0x000000000040ba0e in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) ()
#39 0x0000000000400943 in cc1_main (Argv=..., 
    Argv0=0x7fffffff5348 "/usr/bin/c++", 
    MainAddr=0x406470 <GetExecutablePath(char const*, bool)>)
    at /var/poudriere/jails/120CURRENTamd64/usr/src/contrib/llvm/tools/clang/tools/driver/cc1_main.cpp:221
#40 0x0000000000409038 in ExecuteCC1Tool (Tool=..., argv=...)
    at /var/poudriere/jails/120CURRENTamd64/usr/src/contrib/llvm/tools/clang/tools/driver/driver.cpp:306
#41 main (argc_=<optimized out>, argv_=<optimized out>)
    at /var/poudriere/jails/120CURRENTamd64/usr/src/contrib/llvm/tools/clang/tools/driver/driver.cpp:387

__stack_chk_fail_local() is part of the stack smash detector.  It appears that the stack check canary is random, which might have something to do with the failures being random ... or we're getting random stack smashes.
Comment 56 Don Lewis freebsd_committer 2017-08-05 22:21:54 UTC
The stack check failure in c++ for the node and the earlier cmake build failures were also in clang::FunctionProtoType::Profile().

The stack doesn't look too damaged, since the caller info looks reasonable.  Either a partial smash or a false alarm.
Comment 57 Don Lewis freebsd_committer 2017-08-06 20:45:29 UTC
(In reply to Don Lewis from comment #55)
> #3  0x0000000000a4cf46 in  clang::FunctionProtoType::Profile(llvm::FoldingSetNodeID&, clang::ASTContext const&) ()
> #4  0x0000000002c33ced in llvm::FoldingSetBase::GrowBucketCount(unsigned int)
    ()

Something very strange is going on here, llvm::FoldingSetBase::GrowBucketCount() doesn't call clang::FunctionProtoType::Profile().
Comment 58 Nils Beyer 2017-08-06 20:49:36 UTC
(In reply to Don Lewis from comment #57)

Hmm, "speculative execution" aka "slipped RIP" perhaps?
Comment 59 Don Lewis freebsd_committer 2017-08-07 00:19:01 UTC
(In reply to Nils Beyer from comment #58)
Actually, I'm not so sure now.  If I disassemble llvm::FoldingSetBase::GrowBucketCount(), I see this computed call:

   0x0000000002c33cea <+250>:	callq  *0x18(%rax)
=> 0x0000000002c33ced <+253>:	mov    0x8(%r12),%rcx

so we could be getting to clang::FunctionProtoType::Profile() by way of some c++ magic.
Comment 60 Conrad Meyer freebsd_committer 2017-08-07 00:28:27 UTC
(In reply to Don Lewis from comment #57)
(In reply to Don Lewis from comment #59)

Yeah, that seems to be the case.  From FoldingSet.h:

 39 /// Any node that is to be included in the folding set must be a subclass of
 40 /// FoldingSetNode.  The node class must also define a Profile method used to
 41 /// establish the unique bits of data for the node.  The Profile method is
 42 /// passed a FoldingSetNodeID object which is used to gather the bits.

It's defined on the node type for the template.

FunctionProtoType is defined in tools/clang/include/clang/AST/Type.h; it inherits from FoldingSetNode and indeed has a Profile method.  So that stack isn't unreasonable.
Comment 61 Nils Beyer 2017-08-07 18:54:40 UTC
well:

    https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

"Performance Marginality Problem" - stupid phrase, really. Marketing drivel.

That here could be more of an explaination:

    https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/967913-amd-confirms-linux-performance-marginality-problem-affecting-some-doesn-t-affect-epyc-tr?p=967927#post967927

But that's something, an operating system is not able to solve, is it?
Comment 62 Conrad Meyer freebsd_committer 2017-08-07 19:13:56 UTC
Interesting, thanks for sharing Nils.
Comment 63 Don Lewis freebsd_committer 2017-08-07 22:22:41 UTC
(In reply to Nils Beyer from comment #61)
Hard to say ...

If it is something triggered by context switches and is well understood, it might be possible to add some synchronization code in that path as a workaround.

It might be a slow path through some logic that was missed by AMD's test procedures.  My testing does seem to indicate that there could be a sensitivity to the core clock speed, but it doesn't seem to be a sharp edge, and it also doesn't seem to be temperature sensitive.  Doing a better job of testing should catch the bad chips, at the possible expense of throwing out a lot of them.  It might be possible to work around that in microcode by allowing more time for the signals propagate.

It could also be some sort of other signal integrity problem, which should also be possible to catch in test, but would be more difficult to work around.

In either of these cases, the best long term fix is a new stepping.
Comment 64 Nils Beyer 2017-08-07 22:42:40 UTC
(In reply to Don Lewis from comment #63)

hmm - doesn't sound so good; if you say that there's nothing that FreeBSD can do to workaround/circumvent/mitigate the issue and that there's need for new silicon, then I would close that bug as it isn't fixable at the moment.

The following post (just appeared):

    https://community.amd.com/message/2816419#comment-2816419

where a friendly guy tried both the "ryzen_stress_test.sh" buildworld/buildkernel and your "ryzen_provoke_crash" program from the other bug report - seems to support my assumption as he does not seem to get any "unable to rename" errors in that buildworld/buildkernel test within 14 hours (could be, of course, too short, but promising).

So, if you're okay, then I close that bug hereby...
Comment 65 Don Lewis freebsd_committer 2017-08-07 23:25:30 UTC
I am not convinced that the rename problem isn't a FreeBSD problem.  It's rare enough that it is difficult to track down.
Comment 66 Nils Beyer 2017-08-07 23:35:15 UTC
(In reply to Don Lewis from comment #65)

> I am not convinced that the rename problem isn't a FreeBSD problem.  It's rare enough that it is difficult to track down.

I wasn't able to get that "unable to rename" on my Xeon E3-1220v3. Granted, it's a 4C/4T CPU that has been tackled with 20 threads. I don't have another Non-Ryzen 8C/16T (or higher) CPU at my disposal at the moment so I cannot confirm or not confirm that this issue happens elsewhere (that would hint to a FreeBSD problem indeed).

Maybe someone else can jump in?

Would it be sensible to open a new bug report for that "unable to rename" problem?
Comment 67 Nils Beyer 2017-08-09 22:49:30 UTC
(In reply to Nils Beyer from comment #66)

well, it seems that we have to RMA our CPUs after all:

    https://community.amd.com/message/2816858#comment-2816858
Comment 68 Conrad Meyer freebsd_committer 2017-08-09 23:26:17 UTC
Just curious -- what model Ryzens are you seeing the faults on?

I'm curious to what extent better binning affects the symptoms.  Are R3s and R5s mostly the affected?  Or do R7s see the problem too?  Are 1800X less susceptible than 1700, 1700X?  Etc.
Comment 69 Nils Beyer 2017-08-09 23:33:52 UTC
(In reply to Conrad Meyer from comment #68)

- the poudriere segfault/unable to rename/kernel panic Ryzen is a 1700 stock, manufacturing date: UA 1707PGT

- the unable to rename/never poudriere tested Ryzen is a 1700 stock, manufacturing date: <unknown>
Comment 70 Mark Millard 2017-08-14 17:27:47 UTC
Summary. . . .

Overall: using a virtual machine and restricting
its "processor" count to match one per core does not
seem to change the "compilation" problems much, if
at all.

Details. . .

I have access to a Ryzen7 1800X system again: replacement
motherboard in place.

So I've been experimenting but in a different context:

FreeBSD 11.1-STABLE -r322433 running under a VirtualBox
virtual machine that is running on Windows 10 Pro.

Also I've tried assigning both cases of:

16 "processors" to the VM (a mean of 2 per core)
and:
8 "processors" to the VM (a mean of 1 per core)

Either way building ports gets failures where retrying
either fails at a different place or works. (I'm using
poudriere set to match PARALLEL_JOBS to the VM
"processor" count (8 in the poudriere.conf example below).
devel/ghc seems the most reliable/quickest to fail of the
things attempted so far (only a few failures still).

Via grep for Failed :

[02:26:39] [11] [00:08:55] Finished devel/llvm39 | llvm39-3.9.1_6: Failed: build
[03:35:53] [03] [00:07:35] Finished lang/ghc | ghc-8.0.2_1: Failed: build
[04:03:15] [04] [00:10:32] Finished math/openblas | openblas-0.2.19_1,1: Failed: build
[01:12:55] [09] [00:00:11] Finished net-im/farstream | farstream-0.2.7: Failed: fetch
[03:43:13] [05] [03:22:48] Finished java/openjdk7 | openjdk-7.141.02,1: Failed: build/runaway
[00:09:18] [01] [00:06:15] Finished lang/ghc | ghc-8.0.2_1: Failed: build
[00:14:21] [02] [00:11:18] Finished math/openblas | openblas-0.2.19_1,1: Failed: build

(I have not yet to let poudriere run to completion: I've
been trying various variations in its and the VM's
settings.)

As stands I have:

# diff /usr/local/etc/poudriere.conf.sample /usr/local/etc/poudriere.conf
12a13
> ZPOOL=zrFBSDx6411SL
30c31,32
< FREEBSD_HOST=_PROTO_://_CHANGE_THIS_
---
> #FREEBSD_HOST=_PROTO_://_CHANGE_THIS_
> FREEBSD_HOST=ftp://ftp.freebsd.org
157a160
> PARALLEL_JOBS=8
196a200
> ALLOW_MAKE_JOBS_PACKAGES="pkg ccache py* gcc* llvm* ghc* *webkit* *office* chromium* iridium* mongodb*"
263c267
< #BUILD_AS_NON_ROOT=no
---
> BUILD_AS_NON_ROOT=no

(VirtualBox warns about setting more than 8
"processors". It does seem to avoid a massive
"system" overhead as seen in top in the FreeBSD
instances when 16 "processors" were in use.)

Task Manager's Performance tab's plots indicate
that 8 threads vastly dominate the CPU activity
at the system level when the VM is given 8
"processors".

I do have: kern.hz=100
Comment 71 Mark Millard 2017-08-18 23:22:40 UTC
Has anyone tried doing a bunch of lang/ghc builds
on Ryzen to see the failure rate but for the type
of context indicated below (specifying via a
poudriere context's techniques):

PARALLEL_JOBS=1
ALLOW_MAKE_JOBS=no

and ALLOW_MAKE_JOBS_PACKAGES not having a match
for lang/ghc ?

In other words: avoiding parallel builds during
lang/ghc's build?

If not, is someone willing to try such?

Doing this from:

# uname -apKU
FreeBSD FreeBSDx64OPC 12.0-CURRENT FreeBSD 12.0-CURRENT  r322596M  amd64 amd64 1200040 1200040

that is a VirtualBox guest under Windows 10 Pro
has failed for me so far.

Showing 2 separate attempts in my context:

[11 of 95] Compiling Data.Binary.Class ( libraries/binary/src/Data/Binary/Class.hs, bootstrapping/Data/Binary/Class.o )
ghc/ghc.mk:111: ghc/stage1/package-data.mk: No such file or directory
gmake[2]: *** [utils/ghc-cabal/ghc.mk:48: utils/ghc-cabal/dist/build/tmp/ghc-cabal] Bus error (core dumped)
gmake[1]: *** [Makefile:130: all] Error 2
gmake[1]: Leaving directory '/wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2'
*** Error code 1

vs. (getting well past 11 of 95)

[32 of 95] Compiling Distribution.Version ( libraries/Cabal/Cabal/Distribution/Version.hs, bootstrapping/Distribution/Version.o )
ghc/ghc.mk:111: ghc/stage1/package-data.mk: No such file or directory
gmake[2]: *** [utils/ghc-cabal/ghc.mk:48: utils/ghc-cabal/dist/build/tmp/ghc-cabal] Bus error (core dumped)
gmake[1]: *** [Makefile:130: all] Error 2
gmake[1]: Leaving directory '/wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2'
*** Error code 1


FYI:

# svnlite info /usr/ports/ | grep "Re[plv]"
Relative URL: ^/head
Repository Root: svn://svn.freebsd.org/ports
Repository UUID: 35697150-7ecd-e111-bb59-0022644237b5
Revision: 448068
Last Changed Rev: 448068

This is the vintage of /usr/ports that is being used.

I cause ports to build based on:

CFLAGS:=		${CFLAGS} ${DEBUG_FLAGS}

instead of:

CFLAGS:=		${CFLAGS:N-O*:N-fno-strict*} ${DEBUG_FLAGS}

(So optimized with with debug information for WITH_DEBUG.)
In my context WITH_DEBUG happens to be in use.

# more /usr/local/etc/poudriere.d/make.conf
WANT_QT_VERBOSE_CONFIGURE=1
#
DEFAULT_VERSIONS+=perl5=5.24 gcc=7
#
# From a local /usr/ports/Mk/bsd.port.mk extension:
ALLOW_OPTIMIZATIONS_FOR_WITH_DEBUG=
#
.if ${.CURDIR:M*/devel/llvm*}
#WITH_DEBUG=
.elif ${.CURDIR:M*/www/webkit-qt5*}
#WITH_DEBUG=
.else
WITH_DEBUG=
.endif
WITH_DEBUG_FILES=
MALLOC_PRODUCTION=

and I use in /usr/ports/Mk/bsd.port.mk:

 STRIP_CMD=	${TRUE}
 .endif
 DEBUG_FLAGS?=	-g
+.if defined(ALLOW_OPTIMIZATIONS_FOR_WITH_DEBUG)
+CFLAGS:=		${CFLAGS} ${DEBUG_FLAGS}
+.else
 CFLAGS:=		${CFLAGS:N-O*:N-fno-strict*} ${DEBUG_FLAGS}
+.endif
 .if defined(INSTALL_TARGET)
 INSTALL_TARGET:=	${INSTALL_TARGET:S/^install-strip$/install/g}
 .endif

In my context the builds were attempted with:

/usr/bin/nohup poudriere bulk -j zrFBSDx64Cjail -w lang/ghc &

This was after a prior parallel attempt had built the
prerequisite packages (but for which lang/ghc failed).
So the non-parallel commands only tried to build
lang/ghc .
Comment 72 Mark Millard 2017-08-19 01:44:54 UTC
(In reply to Mark Millard from comment #71)

I've done a few runs recovering the ghc.*.core
files and took a quick comparison/contrast:


info reg shows a few registers that seem to be
usually the same or very similar in value:

rbp            0x8053f17c8	0x8053f17c8
rbp            0x8053f17c8	0x8053f17c8

rbp            0x8053f17f8	0x8053f17f8

rbp            0x8053f17b8	0x8053f17b8

rsp            0x7fffffff8a08	0x7fffffff8a08
rsp            0x7fffffff8a08	0x7fffffff8a08
rsp            0x7fffffff8a08	0x7fffffff8a08
rsp            0x7fffffff8a08	0x7fffffff8a08

r13            0x3861ad8	59120344
r13            0x3861ad8	59120344
r13            0x3861ad8	59120344
r13            0x3861ad8	59120344

cs             0x43	67
cs             0x43	67
cs             0x43	67
cs             0x43	67

ss             0x3b	59
ss             0x3b	59
ss             0x3b	59
ss             0x3b	59

By contrast rip can vary some
but the values have the status:
"Cannot access memory at"

rip            0x646b7c	0x646b7c
rip            0x2db9b54	0x2db9b54
rip            0x646b7c	0x646b7c
rip            0x2db9b62	0x2db9b62

So it jumped to an address that can not be
accessed in each case.

For reference for these 4 runs:

[34 of 95] Compiling Language.Haskell.Extension ( libraries/Cabal/Cabal/Language/Haskell/Extension.hs, bootstrapping/Language/Haskell/Extension.o )
ghc/ghc.mk:111: ghc/stage1/package-data.mk: No such file or directory
gmake[2]: *** [utils/ghc-cabal/ghc.mk:48: utils/ghc-cabal/dist/build/tmp/ghc-cabal] Bus error (core dumped)
gmake[1]: *** [Makefile:130: all] Error 2
gmake[1]: Leaving directory '/wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2'
*** Error code 1

[35 of 95] Compiling Distribution.Compiler ( libraries/Cabal/Cabal/Distribution/Compiler.hs, bootstrapping/Distribution/Compiler.o )
ghc/ghc.mk:111: ghc/stage1/package-data.mk: No such file or directory
gmake[2]: *** [utils/ghc-cabal/ghc.mk:48: utils/ghc-cabal/dist/build/tmp/ghc-cabal] Bus error (core dumped)
gmake[1]: *** [Makefile:130: all] Error 2
gmake[1]: Leaving directory '/wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2'
*** Error code 1

[34 of 95] Compiling Language.Haskell.Extension ( libraries/Cabal/Cabal/Language/Haskell/Extension.hs, bootstrapping/Language/Haskell/Extension.o )
ghc/ghc.mk:111: ghc/stage1/package-data.mk: No such file or directory
gmake[2]: *** [utils/ghc-cabal/ghc.mk:48: utils/ghc-cabal/dist/build/tmp/ghc-cabal] Bus error (core dumped)
gmake[1]: *** [Makefile:130: all] Error 2
gmake[1]: Leaving directory '/wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2'
*** Error code 1

[11 of 95] Compiling Data.Binary.Class ( libraries/binary/src/Data/Binary/Class.hs, bootstrapping/Data/Binary/Class.o )
ghc/ghc.mk:111: ghc/stage1/package-data.mk: No such file or directory
gmake[2]: *** [utils/ghc-cabal/ghc.mk:48: utils/ghc-cabal/dist/build/tmp/ghc-cabal] Bus error (core dumped)
gmake[1]: *** [Makefile:130: all] Error 2
gmake[1]: Leaving directory '/wrkdirs/usr/ports/lang/ghc/work/ghc-8.0.2'
*** Error code 1
Comment 73 Don Lewis freebsd_committer 2017-08-19 06:10:27 UTC
When I've examined a ghc core file, gdb thought that rip was pointing at code and allowed me to disassemble it.   I didn't see anything that looked like it could cause SIGBUS.

I don't think I've ever had a successful ghc build on my Ryzen machine.


For a while I've been suspicious that the problems are triggered by the migration of threads between CPU cores.  One thing that made me suspect this is that most of the early tests that people did like running games and synthetic tests like prime95 would create a fixed number of threads that probably always stayed running on the same cores.  Parallel sofware builds are a lot more chaotic with lots of processes being created and destroyed, with a lot of thread migration being necessary to keep the load on all cores roughly balanced.

For the last week or so I've been running experiments where I start multiple parallel buildworlds at the same time but with different MAKEOBJDIRPREFIX values and different cpuset cpu masks.  I was looking for any evidence that migration between different threads on the same core, or between different cores in the same CCX, or migrating between different CCXs would trigger build failures.  The interesting result is that I observed no failures at all!  One possibility is that my test script was buggy and was missing build failures.  Another is that the value that I used for "make -j" vs. the number of logical cpus in the cpuset was not resulting in much migration.  A third is that the use of cpuset was inhibiting the ability of for the scheduler to migrate threads to balance the the load across all cores.

I started looking at the scheduler code to see if I could understand what might be going on, but the code is pretty confusing.  I did stumble across some nice sysctl tuning knobs that looked like they might be interesting to experiment with.  The first is kern.sched.balance "Enables the long-term load balancer".  This is enabled by default and periodically moves threads from the most loaded CPU to the least loaded CPU.  I disabled this.  The next knob is kern.sched.steal_idle "Attempts to steal work from other cores before idling".  I disabled this as well. The last is kern.sched.affinity, "Number of hz ticks to keep thread affinity for".  I think if the previous two knobs are turned off, this will only come into play if a thread has been sleeping more than the specified time.  If so, it probably gets scheduled on the CPU with the least load when the thread wakes up.  The default value is 1.  I cranked it up to 1000, which should be long enough for any of its state in cache to have been fully flushed.

After using this big hammer, I started a poudriere run to build my set of ~1700 ports.  The result was interesting.  The only two failures were the typical ghc SIGBUS failure, and chromium failed to build with the rename problem.  CPU utilization wasn't great due to some cores running out of work to do, so I typically saw 5%-10% idle times during the poudriere run.

I think that the affinity knob is probably the key one here.  I'll try cranking it down to something a bit lower and re-enabling the balancing algorithms to see what happens.
Comment 74 Mark Millard 2017-08-19 07:31:50 UTC
(In reply to Don Lewis from comment #73)

I'm trying:

sysctl kern.sched.balance=0
sysctl kern.sched.steal_idle=0

For my test with:

PARALLEL_JOBS=1
ALLOW_MAKE_JOBS=no

and ALLOW_MAKE_JOBS_PACKAGES not having a match
for lang/ghc

While it is still building, so far it is vastly
past where it was failing before.

(Note: kern.sched.balance=0 by itself was not
enough to get this far.)

I'll report later if it completes. If it does
I'll test just kern.sched.steal_idle=0 by itself.
Comment 75 Mark Millard 2017-08-19 09:03:18 UTC
(In reply to Mark Millard from comment #74)

I got a lang/ghc build that completed on
the Ryzen:

# pkg search ghc
ghc-8.0.2_1                    Compiler for the functional language Haskell

This used as a context:

sysctl kern.sched.balance=0
sysctl kern.sched.steal_idle=0

So I'm starting the test of a build attempt
in a context with:

sysctl kern.sched.balance=1
sysctl kern.sched.steal_idle=0

also for:

PARALLEL_JOBS=1
ALLOW_MAKE_JOBS=no

and ALLOW_MAKE_JOBS_PACKAGES not having a match
for lang/ghc

Not that as the system has only general/basic
background activities besides the build that:
kern.sched.balance=1 vs. kern.sched.balance=0
likely makes little difference in my context
for this test: no need to migrate to a less busy
thread.

As stands this new build is well past where the
combination:

sysctl kern.sched.balance=1
sysctl kern.sched.steal_idle=1

was getting. I'd not be surprise if it completes
unless something forces a HW-thread migration to
a less busy hw thread (or should I write "core"?).
Comment 76 Mark Millard 2017-08-19 10:40:21 UTC
(In reply to Mark Millard from comment #75)

sysctl kern.sched.balance=1
sysctl kern.sched.steal_idle=0

also completed the lang/ghc build.

I'm now testing:

sysctl kern.sched.balance=0
sysctl kern.sched.steal_idle=0

but with:

#PARALLEL_JOBS=1
ALLOW_MAKE_JOBS=yes

and ALLOW_MAKE_JOBS_PACKAGES having a match
for lang/ghc
Comment 77 Mark Millard 2017-08-19 11:14:08 UTC
(In reply to Mark Millard from comment #76)

sysctl kern.sched.balance=0
sysctl kern.sched.steal_idle=0

with:

#PARALLEL_JOBS=1
ALLOW_MAKE_JOBS=yes

and ALLOW_MAKE_JOBS_PACKAGES having a match
for lang/ghc

also makes it to completion for building
lang/ghc . (Much of the time 8 hw-threads
being busy.)

[This build took about an hour less than
avoiding parallel builds did.]
Comment 78 Don Lewis freebsd_committer 2017-08-19 17:18:37 UTC
I had x11/linux-c6-xorg-libs fail in patch depends when make got a SIGBUS with
sysctl kern.sched.affinity=100
sysctl kern.sched.balance=1
sysctl kern.sched.steal_idle=1

Also, ghc failed with SIGBUS during the configure phase, so ALLOW_MAKE_JOBS is not a factor on my machine.  That said, I don't think I've ever seen multiple ghc processes running at the same time.
Comment 79 Mark Millard 2017-08-19 22:54:24 UTC
(In reply to Mark Millard from comment #77)

I have started a poudriere bulk -a based on:

sysctl kern.sched.balance=1
sysctl kern.sched.steal_idle=0

with:

#PARALLEL_JOBS=1
ALLOW_MAKE_JOBS=yes

and ALLOW_MAKE_JOBS_PACKAGES having a match
for lang/ghc (and various others, although
this should not matter given ALLOW_MAKE_JOBS)

So far it is working (relative to avoiding SIGEGV
and SIGBUS and the like) and it has managed to
build lang/ghc .

It likely will be days before it completes a pass.
I'll add notes about what it does for SIGSEGV,
SIGBUS, and the like.

So for it appears that kern.sched.steal_idle is the
primary context that matters. If kern.sched.balance
matters it seems to be less frequently that it does.
Comment 80 Don Lewis freebsd_committer 2017-08-20 07:19:29 UTC
The events controlled by kern.sched.balance are timer based and are probably much less frequent than occurences of a CPU running out of work to do and triggering a steal_idle event unless the machine has a very high load average so that all of the CPUs have long queues of runnable processes.  Since I don't really care about scheduling fairness in these experiments, I'm planning on leaving kern.sched.balance=0.  I think that kern.sched.steal_idle is much more interesting.  In order to make the experiments as meaningful as possible, I'd like to keep the overall CPU idle time as low as possible.  If there is a lot of CPU idle time with kern.sched.steal_idle=0, then the reduction of memory bandwidth demand and lower die temperature can confuse the results.

Based on my previous experiment, I suspect that steal_idle events ignore the CPU affinity time setting, which sort of makes sense.

In my latest experiment, I set kern.sched.affinity=10 and set both steal_idle and balance back to 0.  This time ghc successfully built, which I attribute to a lucky roll of the dice.  During the lang/ghc build, ghc (either the bootstrap or newly built version) gets executing a number of times.  How many times it succeeds before eventually getting a SIGBUS seems random.  What is somewhat surprising is that once built, ghc was able to successfully able to build a large number of hs_* ports without any failures.  This time chromium also avoided the rename issue and built successfully.  The only failure was lang/go, which has also failed frequently on my Ryzen machine.  I think the lang/ghc and lang/go issues are distinct with the other random build problems.

What would be interesting would be to dig into sched_ule and tweak steal_idle to restrict the CPUs that it can steal tasks from.  One experiment would be to only allow it to steal from the other thread on the same CPU core.  Another would be to only allow it to steal from a core in the same CCX.
Comment 81 Mark Millard 2017-08-20 08:55:08 UTC
(In reply to Don Lewis from comment #80)

FYI for my tests:

So far with kern.sched.steal_idle=0 I've never had
lang/ghc fail to build.

So far with kern.sched.steal_idle=1 I've always had
lang/ghc fail to build --before it even started using
ghc-stage1. (I monitored via top.) These failures do
seem random for at what point of the lang/ghc build
hat the failure happens. (Builds that work take
vastly longer to complete as well.)

I've done a half dozen build attempts or so each way
over those two alternatives (so far).

lang/ghc builds failed for kern.sched.steal_idle=1
every time that I've tried it, even when poudriere
was configured to avoid parallel builds (including
and avoiding ALLOW_MAKE_JOBS) and lang/ghc was the
only thing being built, prerequisites already in
place.

System-wide memory bandwidth usage being high does
not seem to be important to having the lang/ghc builds
fail.

Most of the failure examples are for CPU temperatures
under 40 DegC. Only for hot days in contexts without
air conditioning have I seen as high as 44 DegC for
the Ryzen PC. (Liquid cooled, two 360 mm radiators,
ten 120 mm fans, low end graphics card, and so on:
all biased to keeping temperatures down.)

Die temperature does not seem to be important to having
the lang/ghc builds fail.

(It is too preliminary to depend on yet but so far the
kern.sched.steal_idle=0 "poudriere bulk -a" has had no
SIGSEGV or SIGBUS like failures for any ports, unlike
when I'd tried with kern.sched.steal_idle=1 .)

I do not expect the kern.sched.steal_idle=0 results to be
luck at this point (up to any kern.sched.balance=1
contribution if involved).
Comment 82 Mark Millard 2017-08-20 18:49:47 UTC
(In reply to Mark Millard from comment #79)

My kern.sched.steal_idle=0 based "poudriere bulk -a"
test has now had one SIGSEGV based build failure:

===>  Building for linux-skype_oss_wrapper-0.1.1
/compat/linux/usr/bin/gcc -O2 -pipe  -fstack-protector -fno-strict-aliasing -m32 -fPIC -c libpulse.c -o libpulse.o
*** Signal 11

But it will be days before I'll be able to test if it
is repeatable or not.

I do not know if anyone else has had a successful
build of audio/linux-skype_oss_wrapper-0.1.1 's
libpulse.o on a Ryzen or not.


As for how far along has it gotten to
get to this? 5399 build of 21813 queued.


But at one point the builds suspended for:

[1] - Stopped (tty output)    /usr/bin/nohup poudriere bulk -j zrFBSDx64SLjail -w -a

and waited until I happened to look and
notice. So it is not as far along as it
would have been otherwise.
Comment 83 Don Lewis freebsd_committer 2017-08-20 21:11:33 UTC
I reran poudriere again with the same settings (steal_idle=0, balance=0, affinity=10) and got no failures this time.  Both ghc and go builds were successful.

The ghc build failed with SIGBUS on my first run with steal_idle=0 and balance=0.

I just tried building audio/linux-skype_oss_wrapper and it succeeded.
Comment 84 Mark Millard 2017-08-20 21:35:24 UTC
(In reply to Don Lewis from comment #83)

Good to know for:

audio/linux-skype_oss_wrapper
lang/ghc
lang/go

Thanks. (It does not look like the lang/go
build has started yet in my "poudriere builk -a".)

I wonder if there are other examples of work
being migrated across hw-threads beyond what:

sysctl kern.sched.balance
sysctl kern.sched.steal_idle
sysctl kern.sched.affinity

can disable. (My ongoing test is in a context
that has only set kern.sched.steal_idle=0 ,
leaving the others at defaults.) 


I may have just had my first example of the
missing file problem:

=======================<phase: package        >============================
===>  Building package for sogo2-activesync-2.3.22
pkg-static: Unable to access file /wrkdirs/usr/ports/www/sogo2-activesync/work/stage/usr/local/GNUstep/Local/Library/SOGo/WebServerResources/ckeditor/plugins/clipboard/dialogs/paste.js:No such file or directory
*** Error code 1

I do not know if this sometimes works for

www/sogo2-activesync

vs. if the failure is repeatable. It will be
days before I can check for myself --after the
"poudriere bulk -a" does what it can.
Comment 85 Mark Millard 2017-08-20 21:41:19 UTC
(In reply to Mark Millard from comment #84)

Looks like I may have issues for linux related
builds and its /compat/linux/usr/bin/gcc use.
I've now gotten a 2nd example:

=======================<phase: build          >============================
===>  Building for linux_libusb-11.0r261448_2
Segmentation fault (core dumped)
Segmentation fault (core dumped)
echo libusb.so.3: /usr/lib/libpthread.a >> .depend
Warning: Object directory not changed from original /wrkdirs/usr/ports/devel/linux_libusb/work/linux_libusb-11.0r261448
/compat/linux/usr/bin/gcc  -O2 -pipe  -I/wrkdirs/usr/ports/devel/linux_libusb/work/sys -fstack-protector -fno-strict-aliasing -DCOMPAT_32BIT -DLIBUSB_GLOBAL_INCLUDE_FILE=\"libusb_global_linux.h\" -DUSB_GLOBAL_INCLUDE_FILE=\"libusb_global_linux.h\" -I ../../sys   -MD  -MF.depend.libusb20.o -MTlibusb20.o -std=gnu99 -Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -Wno-pointer-sign     -c libusb20.c -o libusb20.o
*** Signal 11

Stop.
make[1]: stopped in /wrkdirs/usr/ports/devel/linux_libusb/work/linux_libusb-11.0r261448
*** Error code 1
Comment 86 Mark Millard 2017-08-21 18:50:50 UTC
AMD wants to RMA the CPU for the Ryzen7 1800X PC that
I (sometimes) have access to.

So once the logistics are worked out and arrangements
made I will be out of the Ryzen testing process again.

Once this starts it could be a month or so before I'd
again have access.


FYI: with 13813 built so far the only
SIGSEGV/SIGBUS types have errors under
kern.sched.steal_idle=0 have been the 2
/compat/linux/usr/bin/gcc related SIGSEGV's.
There have been 37 failures overall. Some
are tied to not finding files so I've no
clue if those are tied to the odd problems
with such or not. But most are tied to
source code issues. A few failed to fetch.

The FreeBSD context for these tests is:

# uname -apKU
FreeBSD FBSDx6411SL 11.1-STABLE FreeBSD 11.1-STABLE  r322596  amd64 amd64 1101501 1101501

This is as a guest in a VirtualBox virtual machine
under Windows 10 Pro. 8 "processors" assigned in
VirtualBox. Around 50 GiBytes of RAM assigned.
Comment 87 Don Lewis freebsd_committer 2017-08-21 20:25:07 UTC
I set affinity back to its default value of 1 and got another clean 1700 port poudriere run.  It's curious that the only issues I've had when steal_idle=0 and balance=0 happened when I set affinity=1000.  This is the opposite of what I would expect.

I would expect that migrations controlled by the steal_idle and balance knobs to have similar issues.  In either case, the thread that is getting migrated is one that was preempted by an interrupt, and before being resumed, the scheduler noticed that the thread had exhausted its run time quantum and moved the thread to the back of the run queue for that cpu before resuming the thread that is at the front of the run queue.  The only difference between steal_idle and balance is the event that actually causes the thread to migrate.  When they restart, they basically just execute the kernel code to restore their state before dropping back into user mode where they were preempted from. For some reason, threads that have exhausted their time quantum seem to resume properly on the same CPU that they were previously running, but sometimes go wonky if they resume on some other CPU.

The migrations controlled by the affinity knob are different.  In those cases, the thread has voluntarily put itself to sleep, either because it blocked in a syscall, or perhaps trap on a page fault and then go to sleep in the kernel while the missing page is brought in.  When these threads get a wakeup event, they then execute the remaining part of the syscall or the page fault handler before returning to user mode.  It doesn't seem to matter what CPU these threads restart on.

As a test, I set balance=1 and reduced balance_interval from its default 127 to 10 so that balance events would happen a lot more frequently to try to make up for the steal_idle being disabled.  I had three port build failures.  The first was a guile segfault when building finance/gnucash.  The second was a unit test failure in editors/openoffice-devel.  The third was build runaway in devel/doxygen.

The steal_idle code in sched_ule is topology-aware, so it looks like it should be easy to hack the code to only allow migrations between SMT threads sharing the same core, or cores in the same CCX.
Comment 88 Don Lewis freebsd_committer 2017-08-21 20:37:20 UTC
(In reply to Mark Millard from comment #85)
With steal_idle=0 and balance=0 I was able to sucessfully build both
www/sogo2-activesync and devel/linux_libusb.


The there is definitely some downside to disabling steal_idle.  During my next to last large poudriere run, I was seeing CPU idle% in the teens even while the load average was in the mid-40's.
Comment 89 Don Lewis freebsd_committer 2017-08-22 05:56:49 UTC
In my latest experiment, I hacked the sched_ule steal_idle code to only allow threads to be stolen from the other SMT thread on the same core and set steal_idle=1.  CPU idle time was greatly reduced, but ghc failed with SIGBUS and chromium failed with the rename problem.

I don't necessarily count this as a failure since the first time I tried steal_idle=0, balance=0, and affinity=1000 I got the same two failures.  The rename problems just seem to be really rare.  The ghc failures definitely seemed to improve with steal_idle=0, but this experiment only provides one data point.
Comment 90 Don Lewis freebsd_committer 2017-08-22 15:50:22 UTC
When stealing a thread from the other SMT thread on the same core, another tuning knob comes into play, kern.sched.steal_thresh.  A thread will only be stolen if the that value, which defaults to 2.  My previous experiment used the default, but for my latest experiment, I set it to 1, to match the hardwired value that is used for stealing from other levels of the hierarchy.

The results were definitely poor.  In addition to the usual ghc SIGBUS, lang/guile2 failed to build due to a SIGSEGV, textproc/p5-String-ShellQuote failed in the fetch phase when make got a SIGBUS, and editors/libreoffice failed to build  because of a clang SIGABRT.
Comment 91 Mark Millard 2017-08-22 17:52:36 UTC
My first ever compelted attempt at anything
like:

pousriere bulk -w -a

finished. (It started from a prior interrupted
attempt from a prior boot so some things had
already built and the prior failures were
retried.) It reported "62 FAIL" overall, most
for source code rejections based on a preliminary
look. Few Signals.

But when I went to look at dmesg -a there
were lots of Signals there:

FBSDx6411SL# dmesg -a | grep "^pid" | cut -d " " -f 3-100 | sort | uniq -c | less
   1 (a.out), uid 0: exited on signal 11 (core dumped)
   1 (bash), uid 0: exited on signal 11 (core dumped)
   3 (cc), uid 0: exited on signal 11 (core dumped)
   1 (clang), uid 0: exited on signal 11 (core dumped)
   1 (cm3), uid 0: exited on signal 6 (core dumped)
   1 (conftest) is attempting to use unsafe AIO requests - not logging anymore
   1 (conftest), uid 0: exited on signal 10 (core dumped)
   4 (conftest), uid 0: exited on signal 11 (core dumped)
   4 (conftest), uid 0: exited on signal 6 (core dumped)
  76 (fc-cache), uid 0: exited on signal 11 (core dumped)
   4 (gcc), uid 0: exited on signal 11 (core dumped)
  17 (gdk-pixbuf-query-lo), uid 0: exited on signal 11 (core dumped)
 121 (gio-querymodules-64), uid 0: exited on signal 11 (core dumped)
   1 (go), uid 0: exited on signal 5 (core dumped)
  16 (gtk-query-immodules), uid 0: exited on signal 11 (core dumped)
   1 (nm), uid 0: exited on signal 6 (core dumped)
  17 (pango-querymodules-), uid 0: exited on signal 11 (core dumped)
  24 (pctest), uid 0: exited on signal 11 (core dumped)
  24 (pdtest), uid 0: exited on signal 11 (core dumped)
  24 (pstest), uid 0: exited on signal 11 (core dumped)
  24 (pztest), uid 0: exited on signal 11 (core dumped)
   2 (readonly.exe), uid 0: exited on signal 11
   1 (scm), uid 0: exited on signal 11 (core dumped)
   1 (test_26349), uid 0: exited on signal 4 (core dumped)
  15 (try), uid 0: exited on signal 11

How much of the Signal activity is abnormal I do not know.

These are all from the kern.sched.steal_idle=0
type of test context.

Some other odd items in FreeBSD's demsg -a output were:

interrupt storm detected on "irq19:"; throttling interrupt source

Failed to fully fault in a core file segment at VA 0x800641000 with size 0x1000 to be written at offset 0x25000 for process conftest

Failed to fully fault in a core file segment at VA 0x800643000 with size 0x1000 to be written at offset 0x25000 for process a.out



FYI: systat -vmstat indicates "le0" for irq19.
Comment 92 Don Lewis freebsd_committer 2017-08-23 01:19:42 UTC
I'm building the same set of ports as I do on my FX-8320E machine so I have a reasonable idea of what to expect in terms of package build fallout.  Some amount of core dump messages getting logged is fairly normal.

I've never seen the "Failed to fully fault in a core file segment" message.

The motherboard that I'm using has an igb interface.  The interrupt storm messages are likely to be specific to that chip, driver, or motherboard.

I've set kern.sched.balance=0 for my testing since I suspect it could have similar issues as kern.sched.steal_idle and I want to eliminate that source of noise.

On Ryzen, at the CPU topology level that only includes the SMT threads belonging to one core, the steal_idle code will only steal a thread from the other SMT thread if the load on that other SMT thread exceeds steal_thresh (default 2).  At the other CPU hierarchy levels, the threshold for stealing a thread is hardwired to 1.  That could sometimes allow a thread to be stolen from the other SMT thread on the same core even though that was not allowed on the previous iteration.  Since my last experiment (with steal_thresh=1) exhibited a lot of random failures, I hacked the code to set the threshold at the other hierarchy levels to 2.  I also hacked the code to only steal threads from cores in the same CCX.  The results of this experiment only had two build failures.  One was the usual ghc SIGBUS, and the other appears to have been a SIGSEGV in lang/go14.  The latter was a bit different than the usual go build failures.  All of the ones that I have previously looked at appear to have been caused by corruption of the internal malloc state.

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x0 pc=0x49890c]

runtime stack:
runtime.gothrow(0x6fd3f0, 0x2a)
        /usr/local/go14/src/runtime/panic.go:503 +0x8e
runtime.sigpanic()
        /usr/local/go14/src/runtime/sigpanic_unix.go:14 +0x5e
futexsleep()
        /usr/local/go14/src/runtime/os_freebsd.c:72 +0x6c
runtime.onM(0xc208349f50)
        /usr/local/go14/src/runtime/asm_amd64.s:273 +0x9a
runtime.futexsleep(0xc208116ed8, 0xc200000000, 0xffffffffffffffff)
        /usr/local/go14/src/runtime/os_freebsd.c:58 +0x73
runtime.notesleep(0xc208116ed8)
        /usr/local/go14/src/runtime/lock_futex.go:145 +0xae
stopm()
        /usr/local/go14/src/runtime/proc.c:1178 +0x119
exitsyscall0(0xc2082197a0)
        /usr/local/go14/src/runtime/proc.c:2020 +0xd8
runtime.mcall(0x49b4c4)
        /usr/local/go14/src/runtime/asm_amd64.s:186 +0x5a
Comment 93 Mark Millard 2017-08-23 03:15:24 UTC
(In reply to Don Lewis from comment #92)

FYI for the:

> interrupt storm detected on "irq19:"; throttling interrupt source

and:

> FYI: systat -vmstat indicates "le0" for irq19.

The "le0" is from the selection I happened to make
in Virtualbox's Network "Adapter 1" 's "Adapter Type"
selection.

The "Attached to" was "Bridged Adapter" and
"Name" was "Intel(R) I211 Gigabit Network Connection".
Comment 94 Don Lewis freebsd_committer 2017-08-23 06:40:07 UTC
In my latest experiment, I still restricted the steal_idle to the CCX of the current SMT thread.  At the CCX level, I changed thresh back to 1, but I also excluded the current core so that stealing from the other SMT thread can only happen if the steal_thresh condition was met for it on the first pass.  I still a fair amount of build fallout.  In addition to the usual ghc SIGBUS, I also saw a couple of clang SIGABRT failures when building security/nss and www/webkit2-gtk3.  Also lang/go failed with these errors:

fatal error: runtime\xc2\xb7lock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
fatal error: runtime\xc2\xb7unlock: lock count
[snip]
fatal: morestack on g0
SIGTRAP: trace trap
PC=0x44ef02 m=1 sigcode=1
Comment 95 Mark Millard 2017-08-23 09:44:31 UTC
I've now tried Hyper-V instead of VirtualBox
for building lang/ghc (the most systematic
usually-quick-failure test that I've found).

In this context:

sysctl kern.sched.balance=1
sysctl kern.sched.steal_idle=0

did not seem to help. Nor did:

sysctl kern.sched.balance=0
sysctl kern.sched.steal_idle=0

Hyper-V did not report going beyond
8 virtual processors as a problem,
VirtualBox did.

So for Hyper-V I assigned 14 Virtual
Processors, leaving by count enough for
one core (2 threads) for Windows 10 Pro.
While for VirtualBox I used 8 per its
report when a larger figure is set.

Note: I had to avoid using a M.2 NVME
SSD and instead used a SATA SSD for Hyper-V.
For NVME the attempt to create the virtual
machine crashes Windows 10 Pro. It created
the virtual machine just fine when told to
use the SATA SSD.

Note: When Hyper-V is enabled, VirtualBox
64-bit is blocked --even if every Hyper-V
virtual machine is off. I had to disable
Hyper-V to get VirtualBox working again.
[There were more issues than I've noted here
as well.]
Comment 96 Don Lewis freebsd_committer 2017-08-23 14:43:06 UTC
I undid my hack to restrict steal_idle to the current CCX, but left the thresh=2 override in place.I saw the usual lang/ghc SIGBUS, and lang/go died with its common malloc state corruption issue.  java/openjdk7 failed with the compiler temp file rename issue.  Disappointingly, clang died with a SIGSEGV when building www/webkit-gtk3.
Comment 97 Don Lewis freebsd_committer 2017-08-24 01:16:37 UTC
Since this behavior seems to be sensitive to the scheduler, I tried SCHED_4BSD.  Things started out promising, though ghc died with it's usual SIGBUS.  The chromium build failed due to the compiler temp file rename issue.  Disappointingly the firefox build failed due to a clang SIGABRT.
Comment 98 Mark Millard 2017-08-29 22:41:05 UTC
I tried an experiment of an amd64 -> armv6 cross build
of lang/gcc7 via poudriere and it appears that I got
a SIGSEGV in:

/wrkdirs/usr/ports/lang/gcc7/work/.build/./gcc/xgcc

that may be one of the Ryzen ones.

(It will be a while before a retry might stop at
the same place if it does. I'll report if the
stopping point turns out to be repeatable.)

The context here is FreeBSD 11.1-STABLE -r322591
running as a Hyper-V guest on a machine booted
via Windows 10 Pro. Default work stealing for
hardware threads and load redistribution across
threads are still in place in the kernel
settings: no adjustments.
Comment 99 Mark Millard 2017-08-29 23:03:06 UTC
(In reply to Mark Millard from comment #98)

The retry is way past the earlier SIGSEGV point
(but is still building).

So even under qemu-arm-static emulation of
arm instructions from an arm xgcc the Ryzen
failures appear to occur.
Comment 100 Don Lewis freebsd_committer 2017-08-30 23:41:20 UTC
(In reply to Mark Millard from comment #95)
I wonder if the virtual CPUs in Hyper-V are not pinned to physical CPUs, but are allowed to float around.  That could cause threads in the FreeBSD guest to migrate between CCX's without FreeBSD knowing about it.
Comment 101 Nils Beyer 2017-09-04 08:00:17 UTC
Back from my vacation and got the RMA CPU dated "UA 1730SUS" (end of July) - will try a poudriere run now...
Comment 102 Don Lewis freebsd_committer 2017-10-07 19:51:54 UTC
I finally got around to contacting AMD a couple weeks ago about a warranty replacement for my CPU.  My original CPU had a date code of 1708SUT and the new one has a date code of 1733SUS.  There is no evidence that the replacement was manually tested.

My first poudriere run after installing the new CPU, with all else unchanged (no clock speed changes or the BIOS upgrade requested by AMD) was much improved, but with both of the typical lang/ghc and lang/go build failures.

Next, I wanted to test the LDT fixes that were committed to 12.0-CURRENT a couple days ago, so I upgraded kernel and world from r323398 to r324367.  I also changed the RAM clock from the 1866 MHz override that I had been using to the BIOS auto setting of 2400 MHz.  The following poudriere run succeeded in building lang/ghc (which I've never seen happen when building my full set of ports), but lang/go still failed to build, and one of the py-* ports had a build runaway.

Next I upgraded to the BIOS to AGESA 1.0.0.6b.  This time, lang/go failed to build, guile segfaulted when building finance/gnucash, and devel/doxygen experienced a build runaway.

Other than lang/go, I don't think any of these problems are Ryzen-specific.  I have see this gnucash build problem on my FX-8320E, but I think only when building using a 12.0-CURRENT build jail.  It happens very frequently, but not 100% of the time.  I have also seen build runaway failures on my FX-8320E but they seem to be pretty rare.  They definitely do seem to be more common on my Ryzen machine.
Comment 103 Don Lewis freebsd_committer 2017-10-09 07:17:06 UTC
For my next experiment, I built my usual set of packages using the same set of jails that I use on my other build box:
  10.4 i386
    lang/go build failure (malloc corruption)

  11.1 i386
    lang/go build failure (malloc corruption)
    german/gimp-help multiple python SEGVs (memory corruption?)
    
  11.1 amd64
    lang/go build failure (fatal error: attempt to execute C code on Go stack)
    lang/guile2 build failure (guile SEGV)
    math/scilab build failure (SEGV in Java Runtime Environment called from scilab when building docs)

  12.0 amd64
    lang/go build failure (crash during garbage collection)
    www/chromium configure failure (SIGTRAP in gn due to assertion failure)

The go build failure seems to be specific to Ryzen and my replacement CPU did not fix it.

The ghc build failure seems to be gone after upgrading the a more recent 12.0-CURRENT.  I will try to bisect for the fix when I have a chance.

I've seen the guile failures on both my Ryzen machine and my AMD FX-8320E package build machine.  The failures are somewhat sporadic.

I've never seen the scilab or chromium failures before.

The python SEGV failures look like a corrupted copy of the executable cached in RAM or in tmpfs).  I do have ECC RAM and haven't seen any MCA errors of any kind since replacing my CPU.

I also plan on bisecting userland to try to track down the sporadic build runaway errors that I've recently seen, both in earlier poudriere runs on my replacement CPU as well as on my FX-8320E.  I suspect a userland problem since I've only seen this when building using a 12.0-CURRENT jail and the problem appears to be recent.
Comment 104 Mark Millard 2017-12-02 06:36:04 UTC
(In reply to Don Lewis from comment #103)

I'm experimenting with a Ryzen Threadripper
1950X on head -r326192 an /usr/ports/
-r454407 and its first go build got:

. . .
bytes
strings
hash/adler32
fatal error: attempt to execute C code on Go stack
bufio

runtime stack:
runtime.throw(0x67f725)
        /usr/local/go14/src/runtime/panic.go:491 +0xad
badc()
        /usr/local/go14/src/runtime/stack.c:891 +0x2a
runtime.onM(0x68b110)
        /usr/local/go14/src/runtime/asm_amd64.s:257 +0x68
runtime.mstart()
        /usr/local/go14/src/runtime/proc.c:818

So about the same as you got on amd64 with 11.1 .
The 2nd attempt built go fine.

(The ghc build went fine but I guess that is now
expected, though I do not know just what changed.)


Most failing builds seem to be reliably failing
when retried.

At this point I've been focused on "fails initially
then later retry works" So far none of these trace
back to SIGSEGV or SIGBUS in the failures.

FYI for the initial from-scratch bulk -a
(i.e., with -C after smaller experiments):

QUEUE BUILT FAIL SKIP IGNORE REMAIN TIME     
27777 25049  364 2015    349      0 54:09:28

Context:
Running in a Windows 10 Pro Hyper-V
Virtual Machine, 29 logical processors
assigned (of 2*16 in hardware), 110592
MBytes RAM assigned. Samsung 960 Pro
1 TByte nvme-ssd for the file system
(not a .vhd* file under NTFS), 262144
MByte swap partition on a separate
device (again, not inside NTFS).

UFS, not ZFS
USE_TMPFS=yes
no PARALLEL_JOBS assignment, so, 29 builders
ALLOW_MAKE_JOBS=yes (so possibly 29*29 active processes)
no CCCACHE use

I build with both optimizations and debug
information. . .

/usr/ports/Mk/bsd.port.mk has:

 STRIP_CMD=	${TRUE}
 .endif
 DEBUG_FLAGS?=	-g
+.if defined(ALLOW_OPTIMIZATIONS_FOR_WITH_DEBUG)
+CFLAGS:=		${CFLAGS} ${DEBUG_FLAGS}
+.else
 CFLAGS:=		${CFLAGS:N-O*:N-fno-strict*} ${DEBUG_FLAGS}
+.endif
 .if defined(INSTALL_TARGET)
 INSTALL_TARGET:=	${INSTALL_TARGET:S/^install-strip$/install/g}
 .endif

and /usr/local/etc/poudriere.d/make.conf has:
                                                                                                                                                  
WANT_QT_VERBOSE_CONFIGURE=1
#
DEFAULT_VERSIONS+=perl5=5.24 gcc=7
#
# From a local /usr/ports/Mk/bsd.port.mk extension:
ALLOW_OPTIMIZATIONS_FOR_WITH_DEBUG=
#
.if ${.CURDIR:M*/devel/llvm*}
#WITH_DEBUG=
.elif ${.CURDIR:M*/lang/cling*}
#WITH_DEBUG=
.elif ${.CURDIR:M*/www/*webkit*}
#WITH_DEBUG=
.else
WITH_DEBUG=
.endif
MALLOC_PRODUCTION=

(Note: lang/cling above was added after
observing the -C -a build time-out the
packaging of its 16GB+ build. Similarly
www/qt5-webkit was generalized to
www/*webkit* . Under WITH_DEBUG= these
are massively huge and fail.)

[My only attempt to boot -r326192 directly
did boot but in-use soon ethernet and the
USB keyboard and USB mouse where hung up.
I've not pursued this any farther but may
someday.]

The board/BIOS supports ECC mode but the
RAM in use is not ECC RAM at this point.

lang/guile2 built fine the first time.
So did math/scilab and www/chromium .
Comment 105 Mark Millard 2017-12-02 07:48:49 UTC
(In reply to Mark Millard from comment #104)

FYI: So far what failed initially but
built in a later retry of a bulk -a
(no -C ) is:

tome4-1.5.5
sv-gimp-help-html-2.8.2
zh_CN-gimp-help-html-2.8.2
reflex-20170521
tfel-mfront-2.0.4_1
p5-B-C-1.55
mpqc-mpich-2.3.1_30
samba45-4.5.14
iroffer-dinoex-3.30_2
nifticlib-2.0.0_1
apache-openoffice-devel-4.2.1810071_1,4
en_GB-gimp-help-html-2.8.2
de-gimp-help-html-2.8.2
ca-gimp-help-html-2.8.2
go-1.9.2,1

This is after a couple of retry attempts
that could add to the list.

There are some examples where looking
at the logs and the detailed build order
leads me to expect that

ALLOW_MAKE_JOBS=yes

might not be supported correctly.
An example is:

tfel-mfront-2.0.4_1 :

/usr/bin/ld: cannot find -lMFrontLogStream
c++: error: linker command failed with exit code 1 (use -v to see invocation)
*** [mfront] Error code 1

which may have just happened before libMFrontLogStream
was available.

There were a couple of cases of the initial
failures being:

Killing runaway build after 7200 seconds with no output

but the retry did not get that. (Different
competing processes were involved.)

None of that list were SIGSEGV or SIGBUS
failures according to the log files.

The sequence is:

QUEUE BUILT FAIL SKIP IGNORE REMAIN
27777 25049  364 2015    349      0
 2730   185  354 1840    351      0
 2547   266  348 1580    353      0

Things that moved from SKIP to BUILT
were generally because of depending on
something that I turned off WITH_DEBUG=
for as I went along (such as www/*webkit*
instead of www/qt5-webkit* ).

It does appear that the 1950X threadripper
avoids the issues that I saw with the
1800X.