Bug 219399

Summary: System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen...
Product: Base System Reporter: Nils Beyer <nbe>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Some People CC: alinamahi93, armaangujjar69, bayangsenja44, chris, dragongamescs, emaste, georgiadeeds1, isongsoundtrack, kabukimaskcs, karthik.gana12, kib, marklmi26-fbsd, mubashraameen74, nasreenbibi721, romanmiller720, saamhocks8, shitman71, siyal7230, tablosazi.farahan, tomasjerrii, truckman, zao
Priority: --- Keywords: patch
Version: 11.1-RELEASE   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029
Bug Depends on:    
Bug Blocks: 221029    
Attachments:
Description Flags
core.txt.1
none
screenshot of crash with backtrace
none
Ryzen stress test using buildworld/buildkernel as endless loop
none
Ryzen stress test script - now working and working as user; needed to set MAKEOBJDIRPREFIX in environment not as parameter...
none
tarred archive of the segfault stress test
none
quick&dirty LUA script to read CPU temperatures; needs "superiotool"
none
patch to move amd64 shared page to a lower address to avoid Ryzen problem with executing code near user address upper limit
none
logs of failed poudriere builds
none
Don's Ryzen patch - stripped to the SHAREDPAGE define only...
none
program to cause Ryzen hang/reboot on tweaked FreeBSD by executing code in high memory
none
program to print the address of the signal trampoline none

Description Nils Beyer 2017-05-19 09:13:18 UTC
Hi,

hardware:
---------
    MBO: MSI X370 GAMING PRO CARBON (MS-7A32)
    CPU: AMD Ryzen 1700 (default clock)
    RAM: 2x Crucial 16GB (CT16G4DFD8213.C16FAD)
    GPU: Nvidia GT218

software: FreeBSD 11.0-STABLE #1 r318443M

after several hours of poudriere compilation with 14 threads my system panics and reboots:

> Panic String: vm_fault: fault on nofault entry, addr: fffffe0092de3000

it panics every time - the interval is different every time, but alway less than 24h.

I ran "memtest86+" that passed and ran "Prime95" torture from the "Ultimate Boot CD" for more than 24h without errors or crashes. So I don't think that's a hardware problem.

Unfortunately, I cannot provide a backtrace for some reason:

> root@asbach:/var/crash/#kgdb /usr/lib/debug/boot/kernel/kernel.debug /var/crash/vmcore.1
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "amd64-marcel-freebsd"...
> Cannot access memory at address 0x0
> (kgdb) bt
> #0  0x0000000000000000 in ?? ()
> Cannot access memory at address 0x0
> (kgdb) quit

"vmcore.1" is 5GB large - if someone needs it I can upload it somewhere.

Any ideas of what can I do do to get a stable system?



Thanks and regards,
Nils
Comment 1 Nils Beyer 2017-05-19 09:31:00 UTC
Created attachment 182736 [details]
core.txt.1
Comment 3 Nils Beyer 2017-05-29 11:50:34 UTC
Unfortunately, it seems it's not relevant; I've inserted the code from the first link:
--------------------------------------------------------------------------------
root@asbach:/usr/src/#svnlite diff
Index: sys/amd64/amd64/sigtramp.S
===================================================================
--- sys/amd64/amd64/sigtramp.S  (revision 319101)
+++ sys/amd64/amd64/sigtramp.S  (working copy)
@@ -47,6 +47,13 @@
 0:     hlt                             /* trap priviliged instruction */
        jmp     0b
 
+       /*
+       * Work around a Ryzen bug (say whut?).  There appears to be an
+       * issue with the kernel iretq'ing to a %rip near the end of the
+       * user address space (top of stack).
+       */
+       .space  1088
+
        ALIGN_TEXT
 esigcode:
--------------------------------------------------------------------------------

compiled the kernel, rebooted and started a poudriere bulk build. Crashed after
3h34min with:

    panic: vm_fault: fault on nofault entry, addr: fffffe00131fb000
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2017-05-29 12:35:42 UTC
Regarding the hardware vs software problem.  I can tell from my personal experience that the synthetic tests do not cover everything that the FreeBSD can exercise.  So, I wouldn't take it for granted that the hardware is not at fault.

You also need to provide backtraces from your crashes.
They should be printed on the screen when the system crashes.
Also, please try to use kgdb that comes with the gdb port/package.
And use it on /boot/kernel/kernel.
Comment 5 Nils Beyer 2017-05-29 12:48:50 UTC
1) what kind of stress test from a bootable CD/USB stick do you suggest?

2) kgdb from ports wasn't successful, either:
--------------------------------------------------------------------------------
root@asbach:/var/crash/#kgdb7121 /boot/kernel/kernel /var/crash/vmcore.3
GNU gdb (GDB) 7.12.1 [GDB v7.12.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...done.
done.
Reading symbols from /boot/kernel/kernel...Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...done.
done.
Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Reading symbols from /boot/kernel/vmm.ko...Reading symbols from /usr/lib/debug//boot/kernel/vmm.ko.debug...done.
done.
Reading symbols from /boot/kernel/ums.ko...Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...done.
done.
Reading symbols from /boot/kernel/pflog.ko...Reading symbols from /usr/lib/debug//boot/kernel/pflog.ko.debug...done.
done.
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug...done.
done.
Reading symbols from /boot/kernel/linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug...done.
done.
Reading symbols from /boot/kernel/linux_common.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug...done.
done.
Reading symbols from /boot/kernel/linux64.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux64.ko.debug...done.
done.
Reading symbols from /boot/modules/nvidia.ko...(no debugging symbols found)...done.
Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/linprocfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/tmpfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/tmpfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/fdescfs.ko.debug...done.
done.
No thread selected.
(kgdb) info threads
No threads.
--------------------------------------------------------------------------------

Maybe my minidumps are corrupted, so I've disabled minidumps via:

    sysctl debug.minidump=0

and have started a poudriere bulk build again...
Comment 6 Nils Beyer 2017-05-29 12:59:29 UTC
Providing backtraces from the screen while crashing is a little bit complicated right now because I'm using KDE all the time; and when the crash happens the screen garbles a little bit, and after a while my system reboots - please give me some more time to build a DDB-enabled kernel and switch to the VT console for a poudriere session during night...
Comment 7 Nils Beyer 2017-05-30 07:03:57 UTC
Created attachment 183054 [details]
screenshot of crash with backtrace
Comment 8 Nils Beyer 2017-05-30 07:08:26 UTC
okay, got a screen capture of the crash. It seems related to the Nvidia driver. So I removed the Nvida card and inserted an AMD Radeon card instead.

The non-mini crash dump is usable neither with "kgbd" nor "kgdb7121" - both complain about "cannot read KPML4phys".

At the moment I'm running a poudriere bulk build again...
Comment 9 Andriy Gapon freebsd_committer freebsd_triage 2017-05-30 08:58:52 UTC
Nils,

I am not aware of any "mini test" that would be as good as a poudriere build or a highly parallel buildworld.

I am not sure that the nvidia is really a culprit here, the stack frames beyond the trap frame look rather messed up.  Maybe it's not that surprising given that the original trap type is 0xc, #SS, Stack-Segment Fault.  Perhaps there was a stack overflow.

If in kgdb you do 'list *0xffffffff831e410b', does it also point to _nv000224rm ?

The inability of kgdb to deal with the memory dump suggests a possibility of critical page table structures being corrupt or a significant binary mismatch between the userland (libkvm, etc) and the kernel.

FWIW, I found a bug report with somewhat similar backtrace, but it does not have any enlightening details: bug #193622.

Perhaps a mismatch between the module and the kernel...
Comment 10 Nils Beyer 2017-05-31 07:52:45 UTC
After I've exchanged the graphics card with an AMD one and started the bulk build yesterday, I switched to VT console 1 and let it run. When I came back this morning the system was already crashed.

But this time the monitor had no signal, meaning I got no visual crash dump. And blindly typing "dump" on the keyboard didn't do anything (no HDD activity afterwards); so that the last thing I could do was to push the reset button.

It was a very hard crash - different from the ones with a Nvidia card.

I think that I need to connect a serial cable to antoher computer, correct? But I need to use KDE at the same time. Is that even possible - activating a serial console while using KDE at the same time, and as soon a crash occurs DDB spawns on the serial connection?
Comment 11 Nils Beyer 2017-06-29 20:05:06 UTC
Well, serial console is a little complicated due to the lack of any serial port.
 An USB-to-serial adapter didn't work as I couldn't get a remote connection working.

Anyways; crashes are still there. They range from sudden black screen (fans are still running, any music that played before just stopped) to hard system resets.

Latest beta BIOS (with AGESA 1.0.0.6) is applied. Tried several BIOS settings regarding LLC, downclocking, disabling SMT, increasing SoC or CPU voltage. No improvement.

Yesterday, I've gotten a new mainboard (ASRock AB350 Pro4). And it crashes there too with black screens. Disabling "OPCache" in the BIOS (as suggested in the AMD forum's thread) doesn't help.

I'm really out of ideas here...
Comment 12 Nils Beyer 2017-06-29 20:15:31 UTC
If anyone is interested, here's the URL to the AMD forum's thread I'm referencing to:

    https://community.amd.com/thread/215773
Comment 13 Mark Millard 2017-06-29 21:49:30 UTC
(In reply to Nils Beyer from comment #12)

As an odd data point: I use a Ryzen 1800X
system running FreeBSD head in VirtualBox
under Windows 10. I use it for buildworld
buildklernel for itself and for cross builds
to powerpc64, powerpc, aarch64, and armv6/7.
Also for building amd64 ports for itself.

It even happens to be a X370 Gaming Pro
Carbon (MS-7A32) context. But it has
lots of slower RAM instead of the more
typical less-but-faster RAM. It also
has very good processor cooling as I
understand.

I tend to use -j16 (so all the threads)
for buildworld buildkernel and can keep
the threads busy doing builds for all
the TARGET_ARCH's.

In this context I've not seen any such
problems other than when I first
started: The machine had been
"optimized" to the point of occasional
unreliability initially. Backing off
the memory handling configuration from
what it was when I was given access to
the machine fixed that issue. (Later
the initial RAM was replaced with even
more RAM.)

And, yes, FreeBSD buildworld buildkernel
seemed to push some issue harder than
other testing techniques had. (Not that
I have a detailed fault analysis.)

Note: I've not been using poudriere to
do any of the builds.

Not a great analogy to your context but
the above has some points in common
with your context.
Comment 14 SF 2017-06-30 08:01:22 UTC
I got the same problem:

CPU and Memory-Benchmarks run completely fine but the system crashes randomly under heavy loads.

I tryed using different thermal-grease, different coolers, all kinds of different settings in bios and new bios-versions. Noting of this did change this behaviour so far. I often recognize some kind of soundloop and the screen turning black, no error-message.

This should be what the opener describes.
Comment 15 SF 2017-06-30 08:09:12 UTC
Specs:
FreeBSD 11.0
XFCE
R7 1700x
Asus PRIME B350-PLUS
Ram G.Skill F4-3600C16D-16GTZR
Comment 16 Nils Beyer 2017-06-30 11:07:17 UTC
Sound loops are somewhat new to me - my music that is playing while using the computer just stops - no sound artifacts at all.

Something else I've noticed are often messages in the poudriere logs like:

    error: unable to rename temporary '...' to output file '...': 'No such file or directory'

I'm using tmpfs for the work directories. So I've reconstructed that for a buildworld/buildkernel stress test:
--------------------------------------------------------------------------------------------
#!/bin/sh

trap "exit 1" 1 2 3

cd /usr/src
umount /usr/obj
mount -t tmpfs tmpfs /usr/obj

while [ 1 ]; do
        echo "`date` begin"
        BEG="`date +%s`"
        make -j20 buildworld buildkernel >/usr/obj/${BEG}.log 2>&1
        ERR="$?"
        echo "`date` end - errorcode ${ERR}"
        [ "${ERR}" != "0" ] && cp /usr/obj/${BEG}.log /usr/src/.
        rm /usr/obj/${BEG}.log
done
--------------------------------------------------------------------------------------------


And yes, I get these "unable to rename..." messages as well:
--------------------------------------------------------------------------------------------
1498777486.log:error: unable to rename temporary 'ofb_enc.pico-91eb399a' to output file 'ofb_enc.pico': 'No such file or directory'
1498797284.log:error: unable to rename temporary 'CodeGen/MachineModuleInfo.o-f77c51a1' to output file 'CodeGen/MachineModuleInfo.o': 'No such file or directory'
1498806242.log:error: unable to rename temporary 'dsl_deleg.o-68764d23' to output file 'dsl_deleg.o': 'No such file or directory'
1498807306.log:error: unable to rename temporary 'StaticAnalyzer/Frontend/ModelInjector.o-a246df8d' to output file 'StaticAnalyzer/Frontend/ModelInjector.o': 'No such file or directory'
--------------------------------------------------------------------------------------------

Maybe that is a good and quick indicator for something being wrong. Perhaps somebody can try that, too, please?


As a side note, I've tried to set the RAM timings to something even, as I'm using a 2133 MHz RAM; the default timings are CL15, and in the AMD forum's thread here:

    https://community.amd.com/message/2805287

it is said that Ryzen has problems with odd timings. Unfortunately, no improvements raising CL to 16...
Comment 17 SF 2017-06-30 15:45:04 UTC
Try leaving all setting at Auto except of the first 5 values.

My sound is looping a few milliseconds right before the screen turns black and nothing happens anymore, the computer still idles around and seems to run but i cannot get any error-messages. It did never happen while compiling.

So far i found out my system runs stable far longer if i set the powersettings like line-calibration and frequency to regular/optimized and 350hz(higher is better). I also activated EPU and reduced the voltageoffset from cpu by 0,012. My system runs much longer with this which led me to try different cooling-solutions but this didn't solve it. Only accessing power-settings did improve it.
Comment 18 Nils Beyer 2017-06-30 16:19:15 UTC
SF,

sorry, but what five values do you mean exactly? I'm currently not at work so I cannot test anything for the rest of the weekend.

But, because you are experiencing improvments by playing around with voltages, it's probably more a voltage than a timing problem.

One thing that wonders me most is that the CPU keeps very, very cool at 54°C during compiling stuff with 14 threads. We're talking about an octa-core CPU with 3.0 GHz. Even OCing to 3.4GHz, the temperature stays at 58°C.

For my taste this temperature is too low, and this suggests that the Ryzen CPU is permanently undervoltaged.

If you feel boldly, you can try the following:

    - reset all settings to the default values in your BIOS (also SoC voltage)
    - set the CPU VCore voltage staticly to 1.412V

and see if that helps in your situation. This has been suggested here:

    https://community.amd.com/message/2805287?tstart=0#2805287
    (last sentence)

Because I'm home, I cannot test that at the moment.

I know that this has not much to do with FreeBSD at all, but for some reason this behaviour is triggered using it. We have another Ryzen-based sever running Win2008R2, and it doesn't show that behaviour at all...


BTW: my poudriere build system performed a hard reset (instead of a black screen crash) recently after 8.5 hours - luckily; because so I can still access it now...
Comment 19 SF 2017-06-30 17:01:30 UTC
To me it is the complete opposite, my system runs too hot. Up to 70°C or higher under full load.

Changing Voltages to lower values and power-settings to what i described improved it to me, my system stays far longer running before crashing. With far longer i mean it runs several hours under heavy load while with default-settings it only runs less then a hour before crashing. We are using different Mainboards which means they have different solutions delivering power to the cpu but in general i did read this from several people having improved their crashing-problems by changing power-settings.

You can try setting your ram manually to what is inside xmp but this didnt made any difference and i tryed all kinds of settings, even custom.

All you can and need to do is changing the values tCL, tRCD, tRP, tRAS. These are four, not five. Only setting these values will make a difference to my experiences. Put command-rate to 2T, 1T is always causing crashes to me. 2T in generals i known to be much more stable then 1T with minor performance-impacts. (2-3%)

You might also check your ramvoltage, if you did forgot this. Increasing helps for higher performance, decreasing gets you more stable.

Don't place more then 2 rams into your system, ensure they are seperated from each other. This will increase stability and performance. Single-ranked ram is also more stable to higher frequencys and has more performance then dual-ranked but dual-ranked is more efficient at costs. Dual-rank will outperform single-ranked ram if they are on equal frequencys.

https://www.amd.com/system/files/2017-06/am4-motherboard-memory-support-list-en.pdf

You can use higher rams at your own risk, i know some people did reach 3600Mhz with my ram already. I can only achieve 3333Mhz.
Comment 20 SF 2017-06-30 18:03:28 UTC
According to your low-temperature-problem. Did you check if your cpu definitely runs at full speed? I turned off powerd and did manually set my frequency within bios to an fixed frequency because it did stay at low frequencys always.

You might need to do this.
Comment 21 SF 2017-06-30 19:48:01 UTC
I think i solved it. It is related to ram, cpu-power-supply and heat.

You must ensure you ram settings arent too fast, it causes your cpu to consume more power and getting high peaks draining the capacitors. Your power-supply might not keep up with this, high ram performance also causes more heat to your cpu.

Ensure your cpu gets enough cooling.

Adjust the power-supply of your mainboard to ensure your cpu doesnt run out of power at high loads or peaks and you must ensure it doesnt overheat.

I somehow found a balance now between all those settings, i get "good" performance and no crashes anymore until yet. With "good" performance i mean something a bit above what ryzen is normally made for.
Comment 22 SF 2017-06-30 21:41:31 UTC
Disappointing, after alot of crashes it did run stable for a long while. Now it keeps crashing over and over again.

Back to the scratchboard...
Comment 23 SF 2017-07-02 23:43:03 UTC
After installing watercooling today i recognised i can change my rymspeed from 2800Mhz to 3200Mhz and being stable.

I bought new ram and installed it friday, switched from G.Skill F4-3600C16D-16GTZR to G.Skill F4-3600C17D-32GTZR. The new ram did only run with 2800Mhz then with 3333Mhz, now it runs with 3200Mhz and is more stable then the old Ram was.

Not a single crash today, at friday it crashed at 3200Mhz 5mins after booting into the desktop.

It is what i sayed:

It is related to ram, cpu-power-supply and heat. A hardware-problem.
Comment 24 SF 2017-07-02 23:47:37 UTC
I installed watercooling yesterday(not today) and switched from 2800Mhz ramspeed to 3200Mhz today, after my last post being disappointed i recognized the system was running in average much more stable even without watercooling.
Comment 25 Nils Beyer 2017-07-03 07:23:03 UTC
70°C is not that hot for full load IMHO - but a water cooling system is always nice. ;-)

If you think that your system runs stable now, could you do me a favor and run my stress test, please? Save the content between the lines in my comment 16 to a new file "/usr/src/stress_test.sh" and then

    cd /usr/src
    /usr/bin/nohup sh stress_test.sh

and let it run. From time to time look into "/usr/src/nohup.out" and search for lines where the errorcode is not zero (0). The corresponding log file will be copied to "/usr/src". I bet that every error is that "unable to rename temporary" error I get with poudriere compilation orgies.

Thanks in advance...
Comment 26 Nils Beyer 2017-07-03 07:55:03 UTC
Sorry, forgot to mention that you need to add an ampersand to the "nohup" line, so it has to look like this:

    /usr/bin/nohup sh stress_test.sh &
Comment 27 SF 2017-07-03 13:40:39 UTC
75°C is the threshold for the cpu overheating, throttling itself down.
60°C ist the threshold for the X-Processors to disable the 4Ghz+-Boost, they need to stay under 60°C. Anything above is not good.

I did look at your test but it needs me to do this at root, i dont know what this is and what exactly this is doing. I don't want to harm my system or risk security, could you give me something different to test?

Still crashfree since yesterday, btw.
Comment 28 SF 2017-07-03 14:13:27 UTC
basically you want me to do a buildworld and buildkernel with record of errormessages?
Comment 29 Nils Beyer 2017-07-03 14:16:22 UTC
Hmm, are you probably a victim of the AMD Ryzen 20°C temperature offset:

    https://www.reddit.com/r/Amd/comments/607xg2/why_do_the_x_ryzen_cpus_have_a_temperature_offset/

? So your real temperatures were aroung 5X°C? BTW: how do you measure the CPU temperature under FreeBSD?


Regarding my script - you are correct; it's an endless buildworld and buildkernel loop. Unfortunately, to catch the AMD Ryzen "bug", you need a TMPFS mount. You can allow a user to mount TMPFS by executing following as root once:

    sysctl vfs.usermount=1

source: https://www.freebsd.org/doc/handbook/usb-disks.html


Here's the modified script that is able to run as user:
--------------------------------------------- SNIP ---------------------------------------------------
#!/bin/sh

OBJDIR="/tmp/ryzen_stress_test"

trap "exit 1" 1 2 3

cd /usr/src
umount ${OBJDIR}
mkdir ${OBJDIR}
mount -t tmpfs tmpfs ${OBJDIR} || exit 1

while [ 1 ]; do
        echo "`date` begin"
        BEG="`date +%s`"
        make -j20 buildworld buildkernel MAKEOBJDIRPREFIX=${OBJDIR} >${OBJDIR}/${BEG}.log 2>&1
        ERR="$?"
        echo "`date` end - errorcode ${ERR}"
        [ "${ERR}" != "0" ] && cp ${OBJDIR}/${BEG}.log ~/.
        rm ${OBJDIR}/${BEG}.log
done
--------------------------------------------- SNIP ---------------------------------------------------

In this version the suspicious log files will be saved to your home dir. Just save everything between the "SNIP"-lines to "ryzen_stress_test.sh" in your home dir and then execute as user:

    /usr/bin/nohup sh ~/ryzen_stress_test.sh &

This spawns the stress test in the background; to kill it, execute as user:

    pkill -f ryzen_stress_test

then the remaining buildworld, buildkernel processes should die afterwards. As user, execute:

    umount /tmp/ryzen_stress_test

in order to unmount and erase the TMPFS buildowlrd/buildkernel temporary folder...
Comment 30 SF 2017-07-03 14:21:11 UTC
mkdir: /tmp/ryzen_stress_test: File exists
Mon Jul  3 16:17:30 CEST 2017 begin

Does it run?

My BIOS is AGESA 1.0.0.6a, the newest beta-bios. It's not this kind of bug.
Comment 31 SF 2017-07-03 14:24:38 UTC
Ok, 5MB file and i get no error-messages.
Comment 32 Nils Beyer 2017-07-03 14:31:36 UTC
Looks good; if you check via the command "top", you should see CPU usage. And please check with the command "mount" whether the TMPFS is really mounted to "/tmp/ryzen_stress_test"; the output should look something like this:
------------------------------------------------------------------------
zroot on / (zfs, local, noatime, nfsv4acls)
devfs on /dev (devfs, local, multilabel)
tmpfs on /tmp/buildworld (tmpfs, local, nosuid, mounted by nbe)
------------------------------------------------------------------------

Last line with "tmpfs" is important.

If the bug is triggered, you will get ".log" files in your home dir. You can look in to "nohup.out" to check the errorcodes of each loop. Errorcode zero (0) means all okay, Errorcode two (2) probably means Ryzen "bug", like here:
------------------------------------------------------------------------
umount: /tmp/ryzen_stress_test: not a file system root directory
mkdir: /tmp/ryzen_stress_test: File exists
Mon Jul  3 16:10:32 CEST 2017 begin
Mon Jul  3 16:23:53 CEST 2017 end - errorcode 2
------------------------------------------------------------------------

5MB log file means that the buildworld/buildkernel is not finished yet. Let it run; it can take several hours to trigger that error. After 48h, if all your loops end with zero errorcode, then your system really seems stable now...
Comment 33 SF 2017-07-03 15:59:12 UTC
I did replace the 14kw thermal grease with a 6kw thermal-pad, the watercooling i use is only made for am3 so it doesnt get enough contact with grease. Having problems with it...

After booting i recognised the machinge is now crashing over and over again short after boot, same as it was friday.

BIOS shows 70°C+ temperatures after crash, i switched back to air-cooling because i am out of thermal-grease and cannot use the watercooling anymore. First thing happened, i needed to switch back to 2800Mhz ram speed because it kept crashing again. It also did crash exactly the same way with watercooling and thermal-pad at 3200Mhz.

Do you know what? I think Ryzen just has a heating-problem.

High ramspeed causes more heat, power-supply affects heatingproblems and the switching from aircooling to watercooling making it stable explains everything.

I cannot test it now, i dont have money for putting the watercooling back.
Comment 34 Nils Beyer 2017-07-03 16:15:40 UTC
I'm sorry to hear that your water-cooling doesn't work anymore. I don't think that the Ryzen per-se has temperature problems, more likely your specific model; perhaps bad silicon or something.

I don't have the X-variant, only the non-X-variant with the boxed cooler. And this combi seems to stay cool; the 54°C as shown under FreeBSD is shown in BIOS, too, when I let it sit there for a while. I assume that while being in BIOS the CPU is fully loaded.

Or the temperature measurement is lying. Then we're screwed anyways.

What air CPU cooler do you use at the moment?


@Mark Millard: are you still with us here? Do you have the possibility and time to run the stress test script for me, please? You need at least 8GB of RAM...



Thanks,
Nils
Comment 35 Nils Beyer 2017-07-03 16:22:23 UTC
Created attachment 184037 [details]
Ryzen stress test using buildworld/buildkernel as endless loop

needs to be run as root; look for "errorcode 2" in "nohup.out" and "unable to rename..." in specific log files...
Comment 36 Nils Beyer 2017-07-03 16:26:57 UTC
Dang it - my script needs to run as root because as user every build is now failing...
Comment 37 SF 2017-07-03 17:26:57 UTC
I have 2 different coolers, Zalman CNPS9500A LED and Scythe Kabuto 3. The Zalman is much better, i also notice stability-differences between using the Zalman or the Scythe. The Zalman, not surprising, runs longer without crashing then the Scythe. The Scythe is just an ordinary top-blower.

If it is of interest to you, in Bios the temperature with Zalman is at 48°C or slightly lower. And no the cpu doesn't run at its full heating in bios, you can add 10°C to 20°C depending on your cooling if your system is under load. Even one single core being under load can cause 10°C+ heat.

I am hardware-developer, filling registers of processors is what i do. You can believe me if i tell you ryzen is having temperature-problems.

Look at this link:
https://community.amd.com/thread/215773?start=30&tstart=0

Someone is saying he has 2 identical systems, one with watercooling and one with aircooling. The system with aircooling is crashing. I did also read from other people saying ryzen is having problems with heating, besides the 20°C offset.
Comment 38 SF 2017-07-03 19:45:55 UTC
Just look at this:
https://www.overclock3d.net/reviews/cases_cooling/amd_ryzen_5_7_cpu_cooler_round_up/6

I didn't read any cooling-tests before because there werent some.

70°C is upper limit to ryzen.
Comment 39 Nils Beyer 2017-07-04 09:17:21 UTC
Created attachment 184051 [details]
Ryzen stress test script - now working and working as user; needed to set MAKEOBJDIRPREFIX in environment not as parameter...
Comment 40 Mark Millard 2017-07-04 09:46:46 UTC
(In reply to Nils Beyer from comment #39)

If the builds use META_MODE then the builds
will avoid rebuilding much of what has not
changed via prior activity. You may need
to clean out $(OBJDIR}'s directory tree
before or after a build --or unmount and
then mount a new tmpfs instead.

The test may need to specify /etc/src.conf
content or other such context if
reproducibility of test conditions is a goal.

I also wonder what happens with the test
on a non-Ryzen system.


Separately. . .

I'm afraid the Ryzen that I use will not be
available for this kind of activity any time
soon. In fact I may lose access to it for a
time.

And I'm not set up for a native FreeBSD boot:
The context is Windows 10 running VirtualBox,
which in turn is running FreeBSD.
Comment 41 Nils Beyer 2017-07-04 09:51:14 UTC
I think you mean this post here:

    https://community.amd.com/message/2802591#comment-2802591

Well, maybe the hysteresis temperature of Ryzen is not 75°C but rather 50°C, meaning as soon as the CPU reaches (IMHO rather low) 50°C it starts becoming susceptible to data corruption. That would explain the behaviour here on my system.

One guy here "tosilva" suspects that there is a signal integrity issue:

    https://community.amd.com/message/2808194#comment-2808194

that could possibly be fixed by raising voltages; what voltages excactly he doesn't tell - so I assume every voltage. And by what amount - no idea.

That's what I'm trying right now, raising every voltage by 100mV:

    CPU Voltage: 1.18750 -> 1.28750
    VPPM: 2.550 -> 2.650
    2.50V Voltage: 2.520 -> 2.620
    DRAM Voltage: 1.210 -> 1.310
    +1.8 Voltage: 1.820 -> 1.920
    VDDP: 0.950 -> 1.050
    1.05V Voltage: 1.072 -> 1.072 (can only be raised several mV, so I left it AUTO)
Comment 42 Nils Beyer 2017-07-04 09:55:05 UTC
(In reply to Mark Millard from comment #40)

> I also wonder what happens with the test on a non-Ryzen system.

tried that - Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz - here's the result so far:
----------------------------------------------------------------------------
#cat nohup.out 
umount: /usr/obj: not a file system root directory
Fri Jun 30 09:51:46 CEST 2017 begin
Fri Jun 30 10:38:47 CEST 2017 end - errorcode 0
Fri Jun 30 10:38:47 CEST 2017 begin
Fri Jun 30 11:25:44 CEST 2017 end - errorcode 0
Fri Jun 30 11:25:44 CEST 2017 begin
Fri Jun 30 12:12:43 CEST 2017 end - errorcode 0
Fri Jun 30 12:12:43 CEST 2017 begin
Fri Jun 30 12:59:34 CEST 2017 end - errorcode 0
Fri Jun 30 12:59:34 CEST 2017 begin
Fri Jun 30 13:46:29 CEST 2017 end - errorcode 0
Fri Jun 30 13:46:29 CEST 2017 begin
Fri Jun 30 14:33:26 CEST 2017 end - errorcode 0
Fri Jun 30 14:33:26 CEST 2017 begin
Fri Jun 30 15:20:24 CEST 2017 end - errorcode 0
Fri Jun 30 15:20:24 CEST 2017 begin
Fri Jun 30 16:07:16 CEST 2017 end - errorcode 0
Fri Jun 30 16:07:16 CEST 2017 begin
Fri Jun 30 16:54:14 CEST 2017 end - errorcode 0
Fri Jun 30 16:54:14 CEST 2017 begin
Fri Jun 30 17:41:08 CEST 2017 end - errorcode 0
Fri Jun 30 17:41:08 CEST 2017 begin
Fri Jun 30 18:28:09 CEST 2017 end - errorcode 0
Fri Jun 30 18:28:09 CEST 2017 begin
Fri Jun 30 19:15:07 CEST 2017 end - errorcode 0
Fri Jun 30 19:15:07 CEST 2017 begin
Fri Jun 30 20:02:06 CEST 2017 end - errorcode 0
Fri Jun 30 20:02:06 CEST 2017 begin
Fri Jun 30 20:49:03 CEST 2017 end - errorcode 0
Fri Jun 30 20:49:03 CEST 2017 begin
Fri Jun 30 21:35:58 CEST 2017 end - errorcode 0
Fri Jun 30 21:35:58 CEST 2017 begin
Fri Jun 30 22:22:56 CEST 2017 end - errorcode 0
Fri Jun 30 22:22:56 CEST 2017 begin
Fri Jun 30 23:09:52 CEST 2017 end - errorcode 0
Fri Jun 30 23:09:52 CEST 2017 begin
Fri Jun 30 23:56:48 CEST 2017 end - errorcode 0
Fri Jun 30 23:56:48 CEST 2017 begin
Sat Jul  1 00:43:44 CEST 2017 end - errorcode 0
Sat Jul  1 00:43:44 CEST 2017 begin
Sat Jul  1 01:30:40 CEST 2017 end - errorcode 0
Sat Jul  1 01:30:40 CEST 2017 begin
Sat Jul  1 02:17:37 CEST 2017 end - errorcode 0
Sat Jul  1 02:17:37 CEST 2017 begin
Sat Jul  1 03:04:37 CEST 2017 end - errorcode 0
Sat Jul  1 03:04:37 CEST 2017 begin
Sat Jul  1 03:51:39 CEST 2017 end - errorcode 0
Sat Jul  1 03:51:39 CEST 2017 begin
Sat Jul  1 04:38:38 CEST 2017 end - errorcode 0
Sat Jul  1 04:38:38 CEST 2017 begin
Sat Jul  1 05:25:35 CEST 2017 end - errorcode 0
Sat Jul  1 05:25:35 CEST 2017 begin
Sat Jul  1 06:12:28 CEST 2017 end - errorcode 0
Sat Jul  1 06:12:28 CEST 2017 begin
Sat Jul  1 06:59:25 CEST 2017 end - errorcode 0
Sat Jul  1 06:59:25 CEST 2017 begin
Sat Jul  1 07:46:19 CEST 2017 end - errorcode 0
Sat Jul  1 07:46:19 CEST 2017 begin
Sat Jul  1 08:33:11 CEST 2017 end - errorcode 0
Sat Jul  1 08:33:11 CEST 2017 begin
Sat Jul  1 09:20:07 CEST 2017 end - errorcode 0
Sat Jul  1 09:20:07 CEST 2017 begin
Sat Jul  1 10:07:04 CEST 2017 end - errorcode 0
Sat Jul  1 10:07:04 CEST 2017 begin
Sat Jul  1 10:53:56 CEST 2017 end - errorcode 0
Sat Jul  1 10:53:56 CEST 2017 begin
Sat Jul  1 11:40:49 CEST 2017 end - errorcode 0
Sat Jul  1 11:40:49 CEST 2017 begin
Sat Jul  1 12:27:43 CEST 2017 end - errorcode 0
Sat Jul  1 12:27:43 CEST 2017 begin
Sat Jul  1 13:14:39 CEST 2017 end - errorcode 0
Sat Jul  1 13:14:39 CEST 2017 begin
Sat Jul  1 14:01:35 CEST 2017 end - errorcode 0
Sat Jul  1 14:01:35 CEST 2017 begin
Sat Jul  1 14:48:31 CEST 2017 end - errorcode 0
Sat Jul  1 14:48:31 CEST 2017 begin
Sat Jul  1 15:35:29 CEST 2017 end - errorcode 0
Sat Jul  1 15:35:29 CEST 2017 begin
Sat Jul  1 16:22:24 CEST 2017 end - errorcode 0
Sat Jul  1 16:22:24 CEST 2017 begin
Sat Jul  1 17:09:16 CEST 2017 end - errorcode 0
Sat Jul  1 17:09:16 CEST 2017 begin
Sat Jul  1 17:56:12 CEST 2017 end - errorcode 0
Sat Jul  1 17:56:12 CEST 2017 begin
Sat Jul  1 18:43:06 CEST 2017 end - errorcode 0
Sat Jul  1 18:43:06 CEST 2017 begin
Sat Jul  1 19:30:00 CEST 2017 end - errorcode 0
Sat Jul  1 19:30:00 CEST 2017 begin
Sat Jul  1 20:16:55 CEST 2017 end - errorcode 0
Sat Jul  1 20:16:55 CEST 2017 begin
Sat Jul  1 21:03:52 CEST 2017 end - errorcode 0
Sat Jul  1 21:03:52 CEST 2017 begin
Sat Jul  1 21:50:47 CEST 2017 end - errorcode 0
Sat Jul  1 21:50:47 CEST 2017 begin
Sat Jul  1 22:37:41 CEST 2017 end - errorcode 0
Sat Jul  1 22:37:41 CEST 2017 begin
Sat Jul  1 23:24:34 CEST 2017 end - errorcode 0
Sat Jul  1 23:24:34 CEST 2017 begin
Sun Jul  2 00:11:29 CEST 2017 end - errorcode 0
Sun Jul  2 00:11:29 CEST 2017 begin
Sun Jul  2 00:58:26 CEST 2017 end - errorcode 0
Sun Jul  2 00:58:26 CEST 2017 begin
Sun Jul  2 01:45:19 CEST 2017 end - errorcode 0
Sun Jul  2 01:45:19 CEST 2017 begin
Sun Jul  2 02:32:14 CEST 2017 end - errorcode 0
Sun Jul  2 02:32:14 CEST 2017 begin
Sun Jul  2 03:19:15 CEST 2017 end - errorcode 0
Sun Jul  2 03:19:15 CEST 2017 begin
Sun Jul  2 04:06:12 CEST 2017 end - errorcode 0
Sun Jul  2 04:06:12 CEST 2017 begin
Sun Jul  2 04:53:10 CEST 2017 end - errorcode 0
Sun Jul  2 04:53:10 CEST 2017 begin
Sun Jul  2 05:40:07 CEST 2017 end - errorcode 0
Sun Jul  2 05:40:07 CEST 2017 begin
Sun Jul  2 06:27:03 CEST 2017 end - errorcode 0
Sun Jul  2 06:27:03 CEST 2017 begin
Sun Jul  2 07:13:59 CEST 2017 end - errorcode 0
Sun Jul  2 07:13:59 CEST 2017 begin
Sun Jul  2 08:00:57 CEST 2017 end - errorcode 0
Sun Jul  2 08:00:57 CEST 2017 begin
Sun Jul  2 08:47:57 CEST 2017 end - errorcode 0
Sun Jul  2 08:47:57 CEST 2017 begin
Sun Jul  2 09:34:49 CEST 2017 end - errorcode 0
Sun Jul  2 09:34:49 CEST 2017 begin
Sun Jul  2 10:21:43 CEST 2017 end - errorcode 0
Sun Jul  2 10:21:43 CEST 2017 begin
Sun Jul  2 11:08:42 CEST 2017 end - errorcode 0
Sun Jul  2 11:08:42 CEST 2017 begin
Sun Jul  2 11:55:37 CEST 2017 end - errorcode 0
Sun Jul  2 11:55:37 CEST 2017 begin
Sun Jul  2 12:42:34 CEST 2017 end - errorcode 0
Sun Jul  2 12:42:34 CEST 2017 begin
Sun Jul  2 13:29:32 CEST 2017 end - errorcode 0
Sun Jul  2 13:29:32 CEST 2017 begin
Sun Jul  2 14:16:29 CEST 2017 end - errorcode 0
Sun Jul  2 14:16:29 CEST 2017 begin
Sun Jul  2 15:03:26 CEST 2017 end - errorcode 0
Sun Jul  2 15:03:26 CEST 2017 begin
Sun Jul  2 15:50:14 CEST 2017 end - errorcode 0
Sun Jul  2 15:50:14 CEST 2017 begin
Sun Jul  2 16:37:12 CEST 2017 end - errorcode 0
Sun Jul  2 16:37:12 CEST 2017 begin
Sun Jul  2 17:24:08 CEST 2017 end - errorcode 0
Sun Jul  2 17:24:08 CEST 2017 begin
Sun Jul  2 18:11:02 CEST 2017 end - errorcode 0
Sun Jul  2 18:11:02 CEST 2017 begin
Sun Jul  2 18:57:55 CEST 2017 end - errorcode 0
Sun Jul  2 18:57:55 CEST 2017 begin
Sun Jul  2 19:44:49 CEST 2017 end - errorcode 0
Sun Jul  2 19:44:49 CEST 2017 begin
Sun Jul  2 20:31:43 CEST 2017 end - errorcode 0
Sun Jul  2 20:31:43 CEST 2017 begin
Sun Jul  2 21:18:39 CEST 2017 end - errorcode 0
Sun Jul  2 21:18:39 CEST 2017 begin
Sun Jul  2 22:05:35 CEST 2017 end - errorcode 0
Sun Jul  2 22:05:35 CEST 2017 begin
Sun Jul  2 22:52:31 CEST 2017 end - errorcode 0
Sun Jul  2 22:52:31 CEST 2017 begin
Sun Jul  2 23:39:32 CEST 2017 end - errorcode 0
Sun Jul  2 23:39:32 CEST 2017 begin
Mon Jul  3 00:26:29 CEST 2017 end - errorcode 0
Mon Jul  3 00:26:29 CEST 2017 begin
Mon Jul  3 01:13:26 CEST 2017 end - errorcode 0
Mon Jul  3 01:13:26 CEST 2017 begin
Mon Jul  3 02:00:22 CEST 2017 end - errorcode 0
Mon Jul  3 02:00:22 CEST 2017 begin
Mon Jul  3 02:47:20 CEST 2017 end - errorcode 0
Mon Jul  3 02:47:20 CEST 2017 begin
Mon Jul  3 03:34:21 CEST 2017 end - errorcode 0
Mon Jul  3 03:34:21 CEST 2017 begin
Mon Jul  3 04:21:17 CEST 2017 end - errorcode 0
Mon Jul  3 04:21:17 CEST 2017 begin
Mon Jul  3 05:08:15 CEST 2017 end - errorcode 0
Mon Jul  3 05:08:15 CEST 2017 begin
Mon Jul  3 05:55:12 CEST 2017 end - errorcode 0
Mon Jul  3 05:55:12 CEST 2017 begin
Mon Jul  3 06:42:08 CEST 2017 end - errorcode 0
Mon Jul  3 06:42:08 CEST 2017 begin
Mon Jul  3 07:29:09 CEST 2017 end - errorcode 0
Mon Jul  3 07:29:09 CEST 2017 begin
Mon Jul  3 08:16:08 CEST 2017 end - errorcode 0
Mon Jul  3 08:16:08 CEST 2017 begin
Mon Jul  3 09:03:08 CEST 2017 end - errorcode 0
Mon Jul  3 09:03:08 CEST 2017 begin
----------------------------------------------------------------------------
Comment 43 SF 2017-07-04 12:29:16 UTC
No, as i did write but might didn't clarify enough. My problems started at 70°C completely crashing, on the site i posted they say 70°C is the highest temperature for ryzen which confirms my experience. Ram is already starting at lower temperatures making problems.

If your thermal-paste is too weak then your heat will build up very high under load, causing it to give you random errors because your load isn't always equally the same.

Just add better cooling to ryzen, thats it.
Comment 44 Nils Beyer 2017-07-05 15:03:33 UTC
Quick update: after increasing the voltages in BIOS, the poudriere bulk build is still running for about 24h now...
Comment 45 Nils Beyer 2017-07-05 16:44:36 UTC
Look what we've gotten here in "dmesg":
---------------------------------------------------------------------------
MCA: Bank 1, Status 0x90200000000b0151
MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
MCA: CPU 14 COR ICACHE L1 IRD error
---------------------------------------------------------------------------

Still running...
Comment 46 Nils Beyer 2017-07-05 21:50:19 UTC
And it hard-locked; back to start...
Comment 47 Mark Millard 2017-07-06 05:11:47 UTC
(In reply to Mark Millard from comment #40)

Any attempt to run the request test is even more
blocked now:

The Ryzen system became unavailable early. An
attempted BIOS update for the MSI x370 Gaming
Pro Carbon from 7A32v15 to 7A32v17 left the PC
bricked with a debug LED that indicates "CPU not
detected or fail". It does not even begin to POST
as far as can be seen otherwise. There is no
publicly supported way to update the BIOS from
this status.

(I've been applying BIOS updates that report
improved memory compatibility or handling for
now. Other then that subject area I'd have been
happy to just leave the BIOS alone from the
beginning. The PC being an early-adopter PC I
figured on more BIOS updates applied than for
something well established.)
Comment 48 SF 2017-07-06 14:30:48 UTC
You dram-voltage is too low, it should be 1,35v. Leave all other settings at default, thats ridiculous.

Look into your bios, there is an meno called spd-information. Look into it and use the xmp-data to set your ram manually to this settings. If your pc doesn't boot, adjust the frequency of the ram. Nothing else because this helps nothing.

The power-settings i talked of are cpu switching frequency, power phase control and line load calibration. You don't need to access cpu-voltage, decreasing it might help.

As i told you before, you will get it more stable but not entirely stable because your system needs to be cooled different.

The stuff you think and talk about is from lots of intel-people who think this works like on an intel which can run at 90°C temperatures or such crazy shit. No, thats different with ryzen, you need good cooling at stock-settings and high-end cooling for overclocking.
Comment 49 Nils Beyer 2017-07-06 15:50:00 UTC
Okay, in BIOS, I've loaded the optimized defaults and set

    - restore on AC/power loss -> power on
    - SATA hot plug -> enabled
    - DDR voltage -> 1.3500

and started a poudriere bulk build. Let's see how it goes.

Regarding temperatures; using the output of "superiotool -de", CPU temperature says 54°C while working on 15 threads, so I'm very confident that I don't have any temperature problems. The system stands in a climated room, too.

And yes, the Intels can get very hot; look at the new i9-7900x - these temps are inhuman... ;-)
Comment 50 SF 2017-07-06 18:25:55 UTC
I did read that B350 boards are shitty, my board is B350. I gues thats the reason why cooling+ram+power-settings affect it. The chipset is somehow unstable with all of this.
Comment 51 SF 2017-07-07 11:12:26 UTC
Its not caused by the b350-chipset, its caused by power-supply of boards with the b350-chipset. I was on the right track, x370 boards with pci-slots also have this weak power-supply. It seems like just the high-end-boards aren't affected by this.

You can affect this with cooling, ram and power-settings within your bios.

I did try this myself and its exactly what i previously sayed, only power-settings give you more stableness.

Set CPU Switching Frequency to the highest you can. Set CPU Power Phase Control to Full Phase Mode. CPU/SOC Load Line Calibration is something i still don't understand what it is doing but setting both to extreme seems to worsen it.

Do you remember me saying the top-blower-cooler is worse then the zalman? Do you know why? The top-blower blows the hot air onto the heat-sinks of the power-supply. Do you know why watercooling was much more stable? All the hot stuff got transported away from the board, having the heatsinks getting much less heat.

With aircooling i can get stable 3200Mhz with my ram, this previously wasn't possible because i mistakenly put some weak thermal-grease onto it. Thinking this didn't do much.
Comment 52 Nils Beyer 2017-07-10 15:31:14 UTC
Increasing the DRAM voltage to 1.35V did only help for about 25 hours; then it hard-locked again. Perhaps I need ECC RAM, I don't know. At the moment, I'm waiting for a new AGESA version from AMD... :-(
Comment 53 SF 2017-07-10 16:05:57 UTC
This is useless, this is caused by the power-supply not by the ram. Did you ever read what i wrote?

The power-supply is overheating and it has not enough phases, you need to manually set some setting within your bios according to this. You can find several threads within the internet and what they say exactly matches what i did say here at the very first posts. I could only achieve more stableness with power-settings, power-settings could only get improved if cooling was correct and ram settings could be increased after doing this.
Comment 54 Nils Beyer 2017-07-10 16:22:56 UTC
I did read what you've written. My PSU ist a be quiet! "System Power 8 600W" - should be plenty of enough power reserve - the graphics card is a passive cooled Radeon RX460. Total needed power of the system is less than 200W regarding the USV measurements during compilation orgies.

I don't have the options to modify the mainboard VRMs to "full phase" or extreme of whatever profile. LLC isn't available, too. The behaviour of my new ASRock B350 board is exactly the same as the MSI X370 board: hard-locks; so for me, there's _no_ difference between the mainstream and the enthusiast tiers. And no, the MSI X370 board doesn't have any PCI slots.

My case is a Zalman Neo Z9 with a total of five chassis fans. If there is any heat, it will be transported out immediately.

In my eyes it's a design error somewhere; in CPU and/or mainboard logic. Maybe it's even a bug in the OS - but that's something I cannot judge. Either the system is rock-stable or it isn't; there's no between...
Comment 55 SF 2017-07-10 16:32:05 UTC
I talk about the power-supply on the mainboard, there is a power-supply too. It has several phases and switched between them to supply the cpu with a lower voltage then 12v. This is called the switching-frequency, you need to change the settings of this within your bios. This makes your system more stable, next thing is line load calibration. Dont set this to extreme, dont set this to regular. I gues your capacitors run empty on regular and on extreme its overheating but i dont know because i cant see its temperatures.

The power-supply of b350-motherboards has 4+2 phases like also some of the x370-boards, thats not enough to make the cpu entirely stable. Its running out of power. Besides this it also overheats like hell.

Buy a motherboard with more phases, you can identify them by looking left to your cpu, the black blocks. These are inductives, counting them makes the total of phases. I know it because i develop such things.
Comment 56 Don Lewis freebsd_committer freebsd_triage 2017-07-13 00:23:06 UTC
I'm seeing similar problems here.
  Ryzen 1700X
  Gigabyte AX370-Gaming 5
  64GB Crucial DDR-2400 ECC RAM (from Gigabyte QVL)
  Seasonic 650W Prime Titanium PSU
  NVIDIA GT21x video card (VESA text mode only for the console, no X11,
  and no nvidia driver).

I originally started with a Gigabyte AB350-Gaming motherboard but swapped it out due to high VRM temps and lack of ECC RAM support.  Both boards ran four passes of memtest86 and overnight prime95 runs w/o error.  Thee first board had the
original BIOS that didn't know about the 20C Tctl offset, so it always ran its fans very fast.  The F5 BIOS that I installed on the X370 board had the AGESA 1004a update and knew about the Tctl offset.  I've been seeing idle temps in the BIOS of about 35C on it.

Both boards either randomly lock up (blank console) or silently reboot (no panic messages in /var/log/messages and no crash dumps), along with some other errors while building ports with poudriere.  I was getting a bunch of the "unable to rename temporary" errors while running an older version of FreeBSD 12.0 (r320316?), but haven't seen that since upgrading to r320570.

Things I've tried since upgrading to the new motherboard:
  Set fan speed at max
  Core performance boost -> off
  SMT -> off
  RAM 2400MHZ-> 1866 MHz
  CPU 3.4GHz -> 3.0GHz
  Disable 4 cores (auto -> 2+2) - Seems to run 2x longer at half throughput
    before rebooting (which happened very early in the morning when the
    room temp was at its lowest).  Exhaust air was very cool.
  Dragonfly patch

I just upgraded to the latest (F6) BIOS which has AGESA 1006.  I should have some results in a few hours, but at this point, I'm not optimistic.

In the meantime, my AMD FX-8320E is rock solid under the same sort of load, but  just takes too long to build all the packages sets that I use.
Comment 57 Mark Millard 2017-07-13 01:01:07 UTC
(In reply to Don Lewis from comment #56)

While I'm not claiming it is a fix to what
you are seeing there is a recent time
frame with a nasty error to avoid:

-r319722 through -r320651 (fixed by -r320652)
(Relevant for non-invariants kernel builds
of head.)

Another issue is the META_MODE error for
rebuilding the kernel:

-r320220 through -r320918 (fixed by -r320919)

Without this fix one should force complete
kernel rebuilds by deleting the relevant
directory tree first --or otherwise validate
that everything was rebuilt that should have
been rebuilt and dealing with anything that
did not get rebuilt.

(-r320570 fixes another error but it happens
that amd64 accidentally worked before the fix
while powerpc and possibly armv6/v7 did not.)

Avoiding these problem version ranges may help
getting extra problems in the mix for the
investigations.
Comment 58 Don Lewis freebsd_committer freebsd_triage 2017-07-13 01:22:28 UTC
(In reply to Mark Millard from comment #57)
I have both INVARIANTS and WITNESS enabled in my kernel, which is costing me some performance.  I also always do full rebuilds.  My FX-8320E machine is using the same source revision and kernel config.  It has been busy building ports for the last 2 1/2 days and has been totally stable.
Comment 59 Mark Millard 2017-07-13 02:14:47 UTC
(In reply to Don Lewis from comment #58)

Good to know that 11.x variants (likely non-debug) and head (12) debug
kernels both get the problems.

Unfortunately the MSI X370 Gaming Pro Carbon 1800X context that I had access
to is unavailable: a BIOS update (from 7A32v15 to 7A32v17) turned it into a
brick. I've no clue if eventual access to another will be as well behaved.

I had no problems but ran FreeBSD (head but non-debug/non-invarints) in a
VirtualBox virtual machine under Windows 10 Pro.

It would be interesting for someone with a known workload that
fails on known hardware could try the workload under a virtual machine
under, say, Windows 10, on the example hardware in order to see if the
problem is repeatable where the OS really in control is a primary one for
the X370 board makers.

If the problem does not occur then the implication would be some
difference in the handling of things makes the difference in the
overall result. (Not that I have a clue what it would then be that
made the difference.)

If the problem instead repeated, that too would be interesting and
might be of more interest to the board makers: crashes under a
primary OS.
Comment 60 Don Lewis freebsd_committer freebsd_triage 2017-07-13 02:47:47 UTC
There have been a lot of reports of problems from Linux users having problems with heavy parallel compilation loads, though the symptoms are different, usually random compiler and bash segfaults.  These problems have been reproduced in Windows Subsystem for Linux (WSL), which is basically a Ubuntu environment that runs under Windows.

So far AGESA 1006 is actually looking promising (knock on wood).  I'm currently running all eight cores and poudriere has now been running a little over four hours. My earlier tests mostly died around three hours.  My test with four cores ran for six.  I'll let this run and if it runs to completion, probably sometime tomorrow morning, I'll start undoing some of the detuning that I've done for the current configuration.
Comment 61 SF 2017-07-13 06:44:13 UTC
I run Agesa 1.0.0.6a, its the newest version you can get and still have problems. Read my posts and try what i did if you seriously search for an solution, otherwise it is a waste of time reading here.
Comment 62 Don Lewis freebsd_committer freebsd_triage 2017-07-13 09:18:32 UTC
I've had by far the best results so far with the F6 (AGESA 1006) BIOS.  The system
stayed up, though I had some unexpected port build failures.

lang rust failed with this error, which I've seen a few times before
   <jemalloc>: jemalloc_arena.c:821: Failed assertion: "nstime_compare(&decay->epoch, &time) <= 0"

From what I've been told, this may be caused by TSC appearing to go backwards, though this needs further investigation.

hdf5 failed for the same reason, as did gstreamer-plugins-vp8.

I'd hoped this problem would go away, but it should be possible to write
some code to narrow this problem down.
Comment 63 SF 2017-07-13 10:03:48 UTC
It still fails, i confirm my experience also improved with 1.0.0.6a but you try alot of dumpshit. I give you a solution and you dont even try it? Seriously? I did run completely stable since friday and crashed today but i doubt this crash has something to do with the issue we talk about here because it is reproducable and always happens at exactly the same thing i do. Normally the error we talk about here happens random.
Comment 64 SF 2017-07-13 10:40:45 UTC
It is the same problem we talk about here, reducing ramspeed made it stable again. But why didn't fail anything over such a long time and now at cooler days it crashs?
Comment 65 Don Lewis freebsd_committer freebsd_triage 2017-07-13 23:24:42 UTC
(In reply to SF from comment #63)

The motherboard I'm currently using has six Vcore VRM phases.  Basically the top of the line for Gigabyte AM4 boards.  The only difference between this board and the Gigabyte flagship is that this board doesn't have an adjustable bclk.

I basically didn't see any difference between this board and the B350 board that I was initially using.  Both crashed or locked up when doing parallel compiles, but both survived running 16 threads of Prime95 (actually mprime on FreeBSD because I don't have Windows).

This X370 board has problems with SMT off and half the cores disabled, so basically only four parallel threads running.  That should hardly stress the PSU or VRM at all and temperatures should be pretty low.  Even with everything on, the idle temps in the BIOS look good, so I don't think it's a thermal problem.  My last crash was early this morning, when the room temperature was a lot lower than when the machine was running happily last evening.  There are no VRM knobs in the Gigabyte BIOS other than voltage and LLC.  I would think those wouldn't
be critical at 1/4 load ...

It doesn't appear to be a RAM timing problem.  Cranking the RAM speed down basically has no effect.   ECC should be working so if a single bit error cropped up, it should get corrected.  Memtest86 was clean, even the rowhammer test.

The crashes seem to be fairly random.  Restarting the ports that were building at the time of a crash is often successful.

The run that I did after upgrading to AGESA 1006 was by far the best.  With all eight cores enabled but SMT still off, poudriere ran for a bit more than 10 hours.  As I previously mentioned three ports failed due to the jemalloc problem, but the machine stayed up.  I restarted poudriere and those ports built as well as a number of ports that depended on them.  The build ran for a few hours, but the machine silently rebooted before poudriere finished.   When I restarted poudriere, all but one of the remaining ports built.  I did see any obvious error in the log for the failing port, but it successfully built when I ran poudriere another time.
Comment 66 Don Lewis freebsd_committer freebsd_triage 2017-07-13 23:26:50 UTC
Forgot to mention that I ordered a DB-9 bracket.  I should be able to set up a serial console next week to see if I get any kernel messages before the system hangs or reboots.  Nothing is showing up in the logs and I'm not getting a crashdump.
Comment 67 Nils Beyer 2017-07-14 13:24:00 UTC
Perhaps interesting to you - here is another Ryzen stress test:

    https://github.com/hayamdk/ryzen_segv_test

I had to patch it with:
----------------------------- SNIP -------------------------------------
diff --git a/ryzen_segv_test.c b/ryzen_segv_test.c
index 8d64215..74d8530 100644
--- a/ryzen_segv_test.c
+++ b/ryzen_segv_test.c
@@ -323,11 +323,6 @@ int main(int argc, const char *argv[])
 {
        int64_t loops;
        pthread_t t1, t2, t3;
-#ifdef _MSC_VER
-#else
-       cpu_set_t cpuset;
-       int cpu;
-#endif
        pid_t pid = getpid();
        
        if(argc > 1) {
@@ -352,15 +347,6 @@ int main(int argc, const char *argv[])
        pthread_create(&t2, NULL, (void*)threadx, (void*)1);
        pthread_create(&t3, NULL, (void*)threadx, NULL);
        
-#ifdef _MSC_VER
-#else
-       cpu = random() % n_cpus;
-       CPU_ZERO(&cpuset);
-       CPU_SET(cpu, &cpuset);
-       sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
-       fprintf(stderr, "PID:%d CPU:%d\n", (int)pid, cpu);
-#endif
-       
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);
        pthread_join(t3, NULL);
----------------------------- SNIP -------------------------------------

compiled it and executed:

    ./run.sh 16 2500000

It generates segmentation faults rather quickly (< 1 min).

As a side note: on my Intel E3-1220v3 system there are no segfaults at all (using the same command line)...
Comment 68 SF 2017-07-14 13:26:15 UTC
Exactly what i have, you describe exactly whats happening to me.

None of these settings help, try cpu switching frequency and line load calibration. Set it to somewhere in the middle and cpu switching frequency to the highest you can depending on your cooling.

Related to your compilation-problem, does this for sure indicate an software-problem? It looks like we have 2 different causes.
Comment 69 SF 2017-07-14 16:39:34 UTC
I contacted AMD about this issue and after some messages i got the answer that raising cpu-voltage gets ram more stable. This should only happen if you have incompatible dram if i understand right what amd did write.
Comment 70 Don Lewis freebsd_committer freebsd_triage 2017-07-20 00:00:56 UTC
I now think that AGESA 1006 actually didn't fix anything for me.  I must have gotten lucky with that first poudriere run after the BIOS upgrade.  The next time I ran poudriere, I got a silent reboot after ~3 hours.  The times to failure just looked too consistent for me, so I looked at the poudriere build logs to see what was being built at the time of the crash.  One of them was openjdk7.  One of the ports that got built when I restarted poudriere to build the remaining ports that failed after the BIOS upgrade was openoffice, which uses java, so things started making sense.

If I try try building openjdk7, I can pretty much consistently trigger a system reboot, even with SMT off, only two cores enabled in the BIOS, the CPU clock speed lowered to 3 GHz, and the RAM clock cranked down from 2400 MHz to 1866 MHz.

Then I marked openjdk7 BROKEN so that poudriere doesn't build it and skips the ports that depend on it, the system stayed up and poudriere ran for almost 9 hours, though two ports failed with the jemalloc assertion failure that I previously mentioned.

I also now think that the Dragonfly patch isn't needed on FreeBSD and potentially could be harmful.  It is meant to work around what looks like a Ryzen SMT bug.  The problem appears to be triggered by executing code close to the top of user address space.  On Dragonfly, the signal trampoline code is located just above the stack and very close to the top of user address space.  By adding space to the end of sigtramp.S, the trampoline code is moved to a lower starting address.  On FreeBSD, the signal trampoline code was moved to a separate memory page so that the stack could be marked non-executable.  This page is located at the very top of user address space.  I haven't looked at what all is in this page, but if the contents are loaded started at the bottom of the page, then the start of the signal trampoline is likely to be at a lower address than on Dragonfly.  If other code is loaded in this page after the signal trampoline, then adding space at the end could move that code closer to the danger zone.  In any case, I had been doing much of my testing with SMT disabled, so I removed this patch from my kernel.

After backing out the Dragonfly patch and also marking bootstrap-openjdk as BROKEN to eliminate any vestige of java, setting the RAM and CPU clocks back to auto, I ran poudriere again and the run was mostly successful, though I did see a lang/go build failure due to a runaway build problem.

I then enabled SMT and core performance boost and ran poudriere again.  I observed build failures of lang/go, gdb, and cairo.  I didn't see any obvious problems with the latter two, it looked like something in each just returned the wrong exit status.  Restarted poudriere successfully built the latter two, but go failed again.  The go failures appeared to be caused by some sort of corruption of its malloc state.  Note: go is multi-threaded.

Just for grins, I decided to try building ports in an i386 jail.  I got no unexpected failures.  The results were the same when I re-enabled the java ports.  It successfully built 1594 ports in 8 hours 33 minutes.  I was even able to build lang/ghc on i386.  That one always had segfaults in the bootstrap compiler for me on amd64.  I have no idea if it uses threads, though.

At least on my hardware there are one or more problems with amd64 code.  It might just be multi-threaded processes.  The java problem could also be caused by the hotspot compiler, which may look like self-modifying code.  In any case, it can cause system hangs or reboots and may also corrupt the state of other processes.  I finally received the hardware to set up a serial console yesterday, but I haven't had time to install it yet.  The reboots that I've seen don't seem to leave any trace in the logs, don't seem to trigger ddb, and don't leave crash dumps.
Comment 71 Don Lewis freebsd_committer freebsd_triage 2017-07-20 01:04:20 UTC
(In reply to Nils Beyer from comment #67)

The Linux version of the test pins the entire process to a single randomly-chosen CPU.  The following diff should make it work the same on FreeBSD:

--- ryzen_segv_test.c.orig	2017-07-19 17:26:47.686991000 -0700
+++ ryzen_segv_test.c	2017-07-19 17:35:56.406073000 -0700
@@ -69,6 +69,7 @@
 #include <sys/mman.h>
 #include <sched.h>
 #include <sys/types.h>
+#include <sys/cpuset.h>
 #include <unistd.h>
 
 #endif
@@ -332,7 +333,7 @@
 	pthread_t t1, t2, t3;
 #ifdef _MSC_VER
 #else
-	cpu_set_t cpuset;
+	cpuset_t cpuset;
 	int cpu;
 #endif
 	pid_t pid = getpid();
@@ -365,7 +366,7 @@
 	cpu = random() % n_cpus;
 	CPU_ZERO(&cpuset);
 	CPU_SET(cpu, &cpuset);
-	sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
+	cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, pid, sizeof(cpuset_t), &cpuset);
 	fprintf(stderr, "PID:%d CPU:%d\n", (int)pid, cpu);
 #endif
 	

With that change, I don't see any problems on my machine.  That seems to be an unfairly easy test, though.

If I nuke the same code that you did, I still don't see any failures, though the runtime of each iteration seems to vary a lot.  With loops set to 10000000, the process may run anywhere from a few seconds to close to a minute:

12.804u 0.007s 0:04.27 299.7%	10+167k 2+0io 0pf+0w
23.431u 0.000s 0:07.81 300.0%	10+167k 2+0io 0pf+0w
9.789u 0.000s 0:03.26 300.0%	10+167k 2+0io 0pf+0w
9.656u 0.007s 0:03.22 299.6%	10+167k 2+0io 0pf+0w
9.596u 0.000s 0:03.20 299.6%	10+167k 2+0io 0pf+0w
12.041u 0.000s 0:04.01 300.2%	10+167k 2+0io 0pf+0w
22.294u 0.007s 0:07.43 300.0%	10+167k 2+0io 0pf+0w
13.290u 0.000s 0:04.43 300.0%	10+167k 2+0io 0pf+0w
10.939u 0.007s 0:03.65 299.4%	10+167k 2+0io 0pf+0w
12.953u 0.000s 0:04.31 300.4%	10+167k 2+0io 0pf+0w
11.994u 0.000s 0:04.00 299.7%	10+167k 2+0io 0pf+0w
13.084u 0.000s 0:04.36 300.0%	10+167k 2+0io 0pf+0w
125.599u 0.007s 0:41.87 299.9%	10+167k 2+0io 0pf+0w
Comment 72 SF 2017-07-20 07:38:19 UTC
Dear ,

Your service request : SR #{ticketno:[]} has been reviewed and updated.

Response and Service Request History:

Thank you for the email and feedback. 

Higher memory frequencies on Ryzen can require a higher CPU voltage to ensure full stability, so this would be expected once you start overclocking the memory above 2667Mhz default values. 

If your system requires these settings to be stable when using speeds of 2667Mhz or lower, i recommend sharing your experiences with the Motherboard manufacturer, as you could have a faulty Motherboard or there could be a bug in the Motherboard bios that would need to be addressed. 

We have not received any similar reports of the issues you have described, however we will stay vigilant to see if there are more reports of similar issues from other customers. 


With regards to memory, memory frequency up to 4000Mhz has been unlocked so that Motherboard manufacturers may test and validate higher speed memory configurations on their Motherboards. Furthermore, it allows customers who like to overclock and tune their systems the possibility to extract extra performance by running their memory at a higher frequency than was previously possible. 

Unfortunately overclocking memory can be extremely complicated and many different factors are involved and so there is no guarantee that you will be able to run your memory frequency any higher than the default (2667Mhz or lower) supported values.

Due to the above, Motherboard vendors test various memory types and configurations and certify them on a Qualified Vendor List (QVL) for each Motherboard they manufacturer. I checked your Motherboard and found that Gigabyte have not validated any memory to run at speeds above 3200Mhz using two or more DIMMS. 

As per the QVL, the memory you are using has not been certified to run at 3600Mhz/3466Mhz speed with a Ryzen processor. If you are unhappy with this, i suggest contacting the Motherboard manufacturer. AMD does not guarantee memory will run at any higher than 2667Mhz. 

You can download the QVL for your Motherboard at the following link > https://www.asus.com/uk/Motherboards/PRIME-B350-PLUS/HelpDesk_QVL/

Please refer to our gaming blog where this limitation regarding memory is called out specifically. 

For speed grades greater than DDR4-2667, please refer to a motherboard vendor’s memory QVL list. Each motherboard vendor tests specific speeds, modules, and capacities for their motherboards, and can help you find a memory pairing that works well. It is important you stick to this list for the best and most reliable results.
DDR4 Speed (MT/s)
	Memory Ranks	DIMM Quantities
2667	Single	2
2400	Dual	2
2133	Single	4
1866	Dual	4

https://community.amd.com/community/gaming/blog/2017/03/14/tips-for-building-a-better-amd-ryzen-system


Thank you for contacting AMD. 

Please be advised that this service request will be permanently closed if you do not reply within 10 days. If more time is needed to respond to my e-mail above, please let me know and I will ensure that this service request remains open for you.

In order to update this service request, please respond, leaving the service request reference intact.

Best regards,

AMD Global Customer Care
Comment 73 SF 2017-07-20 07:39:17 UTC
Does anyone of you have ram which is within the compatibility lists?
Comment 74 Nils Beyer 2017-07-20 08:21:35 UTC
(In reply to Don Lewis from comment #70)
Don, many thanks for your extensive report.

Regarding building "java/openjdk7": that never did trigger a hard-lock or a reboot on my system here, unfortunately. The Java compilations of different ports don't care about processor limitations; they always use all available cores - at least here during my poudriere runs. So this could help stressing your system until crash.

Regarding "DragonflyBSD patch": okay, makes sense that this patch is not needed/incompatible on FreeBSD. Thanks for the clarification.

Regarding building "lang/ghc": yes, that build always generates a "bus error" on my Ryzen system. On my Intel system, that build is always successful. Maybe expected behaviour due to wrongly optimized bootstrap.

Regarding i386 builds: interesting that there seems to be no problems in 32bit environments. But I don't want to go back to 32bit. ;-)

Regarding serial console: thanks for trying that out - I fear that you won't get any debug messages during the crashes/lock-ups though. But let's hope the best.
Comment 75 Nils Beyer 2017-07-20 08:27:18 UTC
(In reply to Don Lewis from comment #71)
Using the pin-to-core patch, the stress test doesn't use all cores any more, so that the overall load is around 50%-60% only. With that I couldn't get any segfaults fast enough.

FWIW, here's a GDB output of one of the core files:
---------------------------------------------------------------------------
#gdb7121 ./ryzen_segv_test ryzen_segv_test.core
GNU gdb (GDB) 7.12.1 [GDB v7.12.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./ryzen_segv_test...(no debugging symbols found)...done.
[New LWP 100861]
[New LWP 100711]
[New LWP 100935]
[New LWP 101000]
Core was generated by `./ryzen_segv_test 2500000'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000400c06 in thread1 ()
[Current thread is 1 (LWP 100861)]
(gdb) bt
#0  0x0000000000400c06 in thread1 ()
#1  0x000000080082bbc5 in thread_start (curthread=0x801216500) at /usr/src/lib/libthr/thread/thr_create.c:289
#2  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdfffe000
---------------------------------------------------------------------------

So the backtrace is corrupted and not usable, correct?
Comment 76 Nils Beyer 2017-07-20 08:33:58 UTC
(In reply to SF from comment #72)
SF, thanks for opening a support ticket. What I don't understand is that they still assume that you are overclocking. We're talking about stock settings, are we? I'm using two dual-rank modules at 2133; should be well under their suggested limit of 2400.

At least, AMD now has another support request; I opened a ticket with references to this bug report (and all of the segfaults forum threads) yesterday...
Comment 77 Nils Beyer 2017-07-20 08:41:48 UTC
(In reply to SF from comment #73)
I do; my memory modules are:

    2x Crucial CT16G4DFD8213.C16FAD

The QVL list for the ASRock AB350 Pro4 says:

    http://www.asrock.com/mb/AMD/AB350%20Pro4/#Memory

    Type - Speed - Size - Vendor - Module - Chip - Part No - SS/DS - Single Channel - Dual Channel - OC
    ---------------------------------------------------------------------------------------------------
    DDR4 - 2133 - 16GB - Crucial - CT16G4DFD8213.16FB1 - N/A - N/A - SS - N/A - 4pcs - N/A

No idea if the six/five letters after the "CT16G4DFD8213" do mean anything; I couldn't find any differences using google. And what "SS/DS" means, no idea...
Comment 78 Nils Beyer 2017-07-20 08:44:46 UTC
Well, people - especially in the AMD forums - are very disappointed because of the reactionlessness of AMD itself in these threads; and I can understand them...
Comment 79 SF 2017-07-20 09:23:46 UTC
Such letters can make a difference because the applyed default settings will vary because of it.

My ram, even if not listed, is stable and only suffers the power-problem.
Comment 80 Nils Beyer 2017-07-20 09:32:58 UTC
(In reply to SF from comment #79)

Well, you cannot explicitly order the "16FB1"- or "C16FAD"-variant. All these variants are listed as "CT16G4DFD8213" in all shops. And even Crucial itself doesn't seem to distinguish between these variants:

    https://www.crucial.com/wcsstore/CrucialSAS/pdf/product-flyer/ram/crucial-ddr4-desktop-en.pdf

Maybe these letters only mean a different fabrication date of modules...
Comment 81 SF 2017-07-20 09:58:28 UTC
If so, then there is no difference and it pretty much looks like a shortage of a date.
Comment 82 Nils Beyer 2017-07-20 10:46:45 UTC
Created attachment 184539 [details]
tarred archive of the segfault stress test
Comment 83 Andriy Gapon freebsd_committer freebsd_triage 2017-07-20 10:48:28 UTC
(In reply to Nils Beyer from comment #45)

Interesting... That MCE report is about a correctable problem with the CPU's instruction cache.  Perhaps, the high load triggers a CPU erratum related to that cache.  And maybe the problems of this kind are not always correctable or even detectable. With junk in the instruction cache any kind of a system or application misbehaviour is possible.
Comment 84 Nils Beyer 2017-07-20 10:49:06 UTC
(In reply to Don Lewis from comment #71)

Don, I've attached my "ryzen_segv_test" binary (compiled for 11.1-RC3) incl. the shell scripts as a tar.xz archive. Could you please try "./run.sh 16 2500000" for me?

TIA
Comment 85 Nils Beyer 2017-07-20 11:10:52 UTC
(In reply to Andriy Gapon from comment #83)

makes sense to me; if there are undetected errors in the instruction cache like corrupted memory addresses that would explain these corrupted stack traces, too.

I got these messages on an other Ryzen system, too; different core, but still "MCA: CPU 10 COR ICACHE L1 IRD error". No system lock-ups or reboots there yet, but this system is not poudriering due to lack of RAM (only 16GB). The only thing I did there was buildworlds in a RAM disk; and failed serveral times because of the "unable to rename temporary"-TMPFS-phenonema. The CPU is the same model (1700) but bought three months after my CPU, so I don't think it's the same batch.

For me, the hints lead more and more to a design flaw within the CPU; and hopefully, that could be fixed with a microcode update...
Comment 86 Nils Beyer 2017-07-20 11:31:36 UTC
Created attachment 184542 [details]
quick&dirty LUA script to read CPU temperatures; needs "superiotool"
Comment 87 Don Lewis freebsd_committer freebsd_triage 2017-07-20 18:07:41 UTC
This variant of the Ryzen segv test pins each thread to a randomly chosen CPU.  Top in thread display mode shows that the individual threads are pinned to the reported CPUs.  I ran it overnight with a loop count parameter of 10000000 and observed no errors.

--- ryzen_segv_test.c.orig	2017-07-19 17:26:47.686991000 -0700
+++ ryzen_segv_test2.c	2017-07-19 22:05:45.162536000 -0700
@@ -69,6 +69,8 @@
 #include <sys/mman.h>
 #include <sched.h>
 #include <sys/types.h>
+#include <sys/cpuset.h>
+#include <pthread_np.h>
 #include <unistd.h>
 
 #endif
@@ -332,8 +334,8 @@
 	pthread_t t1, t2, t3;
 #ifdef _MSC_VER
 #else
-	cpu_set_t cpuset;
-	int cpu;
+	cpuset_t cpuset;
+	int cpu1, cpu2, cpu3;
 #endif
 	pid_t pid = getpid();
 	
@@ -362,11 +364,20 @@
 	
 #ifdef _MSC_VER
 #else
-	cpu = random() % n_cpus;
+	cpu1 = random() % n_cpus;
+	cpu2 = random() % n_cpus;
+	cpu3 = random() % n_cpus;
 	CPU_ZERO(&cpuset);
-	CPU_SET(cpu, &cpuset);
-	sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
-	fprintf(stderr, "PID:%d CPU:%d\n", (int)pid, cpu);
+	CPU_SET(cpu1, &cpuset);
+	pthread_setaffinity_np(t1,  sizeof(cpuset_t), &cpuset);
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu2, &cpuset);
+	pthread_setaffinity_np(t2,  sizeof(cpuset_t), &cpuset);
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu3, &cpuset);
+	pthread_setaffinity_np(t3,  sizeof(cpuset_t), &cpuset);
+	/* cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, pid, sizeof(cpuset_t), &cpuset); */
+	fprintf(stderr, "PID:%d CPU1:%d, CPU2:%d CPU3:%d\n", (int)pid, cpu1, cpu2, cpu3);
 #endif
 	
 	pthread_join(t1, NULL);
Comment 88 Nils Beyer 2017-07-20 18:43:33 UTC
(In reply to Don Lewis from comment #87)

thanks for the modified version; I've tried it. It loads the CPU more, yes, but there's still at least one core (sometimes two or more) idling around. I don't get any segmentation faults within an acceptable time frame (<5 minutes).

Please try the version that I've attached

    https://bugs.freebsd.org/bugzilla/attachment.cgi?id=184539

I know you've said that you've already tried it - but perhaps there's something different on my system compiler. Here it segfaults after a maximum of one minute.

I believe that the key to these segmentation faults are inner-thread core switches that are prevented by pinning these threads to specific cores. How far any instruction cache is involved with that, I don't know...

TIA
Comment 89 Don Lewis freebsd_committer freebsd_triage 2017-07-23 20:52:54 UTC
Created attachment 184641 [details]
patch to move amd64 shared page to a lower address to avoid Ryzen problem with executing code near user address upper limit

I've been doing a number of experiments with openjdk7 builds to try to better characterize the Ryzen problem.

First I did a number of openjdk7 builds using cpuset to pin the build to individual cores.  Using cpuset -l 0 to pin the build to the first thread on core 0 would consistently cause a silent reboot on the first or second try.  Pinning  the build to any of the other cores allowed me to successfully build openjdk7.  I ran four builds on each of the other cores to make sure that I wasn't just getting a successful build by chance.  Surprisingly, pinning the build to the second thread on core 0 was also successful.  In any case, the results were consistent with my earlier tests where I disabled SMT and also all but two cores in the BIOS, since those tests always used the first thread on core 0.

I tried building openjdk7 on all cores except the first thread of core 0 by using cpuset -l 1-15 and was also successful.

Based on that positive result, I tried building my default set of ~1600 ports with cpuset -l 1-15.  A little over two hours into the build, the llvm40 build failed with the:
  _arena.c:821: Failed assertion: "nstime_compare(&decay->epoc h, &time) <= 0")
causing the ports that depend on it to be skipped, but everything else built successfully.  When I restarted poudriere, the llvm40 build succeeded, but the system hung after about an hour while running java as part of the openjdk7 build.

Next I tried building with cpuset -l 2-15.  The only problem that I ran into is that the gcc build failed with SIGBUS, causing its dependencies to be skipped.  When I restarted poudriere, gcc5 and the remaining ports build successfully.

I wanted to try to eliminate the possibility of a subtle defect in core 0 as a potential cause of the problem, so I tried adding
 hint.lapic.0.disabled=1
 hint.lapic.1.disabled=1
to /boot/loader.conf, but FreeBSD does not allow the BSP to be disabled B-(

The other thing that is unique about core 0 on my machine is that it looks like all of the external interrupts (but not interprocessor interrupts) go there.  The biggest source of those seemed to be hpet, but I couldn't figure out how to disable that (other than maybe disabling ACPI totally).  When I tried hint.hpet.0.clock=0, all of the CPUs got assigned interrupts from another timer.

The next thing I tried was inspired by the Dragonfly patch.  At least some thread implementations use signals to communicate between threads.  I'm not familiar with OpenJDK, but it is possible that it is such an implementation, so it might be a heavy signal user and spend a lot of cycles in the signal trampoline code.  Our signal trampoline code is in a different location than Dragonfly uses, but it is still close to (in the top page of) the top of user memory.  Even though I got the impression that the Dragonfly patch addresses an issue with SMT, it does involve an interaction between interrupts and execution of code near the top of user memory.

As an experiment, I patched the kernel to move the location of the shared page lower by PAGE_SIZE.  I'm not sure if it is necessary, but the page at the old location has the same rwx permissions and is zero filled.  I don't know if the bug is triggered by executing code close to the upper address boundary or close to a permission boundary.  The preliminary results so far are very promising.  With the patch applied, I am able to successfully build openjdk7 either unpinned or pinned to the first thread of core 0.

I just kicked off an unpinned ~1600 port poudriere run.  I should have results of that late today.

The patch is attached.
Comment 90 Don Lewis freebsd_committer freebsd_triage 2017-07-24 06:35:44 UTC
With my patch to relocate the shared page, I was finally able to get a successful poudriere run.  It built 1596 ports in about 8 1/2 hours with no errors.

I'm still not able to build lang/ghc, which I think is due to another problem of some sort.  The SIGBUS error generated by the bootstrap compiler seems to be totally repeatable.

I think my patch did introduce a new problem.  Whenever a process core dumps, this new message gets logged:
  Failed to fully fault in a core file segment at VA 0x7fffffffe000 with size 0x2000 to be written at offset 0x7fef000 for process ghc

I'll try reverting the shared page size part of the change.
Comment 91 Nils Beyer 2017-07-24 07:14:19 UTC
(In reply to Don Lewis from comment #89)

really interessting research you've done there; big thanks for that. If the shifting of the shared page really helps getting the system stable, is it still a hardware problem/bug then or more a software problem? If hardware, is that fixable by a microcode update or is the CPU design per-se "incompatible"? What do you think?

Anyways, I've applied your patch and started a poudriere build. Based on your experiences, I'm somewhat optimistic now. I'll let you know how's it going here.

The "ryzen_segv_test" still fails despite the patch, the only way to circumvent that is to disable SMT. But that's no final option for me...
Comment 92 Konstantin Belousov freebsd_committer freebsd_triage 2017-07-24 08:10:58 UTC
(In reply to Don Lewis from comment #90)
Yes, the coredumping message is because the object backing the shared page entry is only initialized with single page, so attempt to read from the second page cannot be satisfied without the backing physical memory.

From what I see in the amd support forums/reddit threads, the issue is not diagnosed yet and AMD is silent about it.  Most strange thing I found was a claim that sometimes CPU executes instructions from %rip+0x40 byte instead of %rip.  That would explain Dillon' fix but probably have no effect on FreeBSD trampoline layout, unless some more weirdness is in place.

If the problem indeed hardware (I hope so) and AMD will be able to identify and fix it, I very much dislike the global change to the AMD64 native VA layout.  My concerns are due to USRSTACK value leaking to tools and becoming part of the ABI.  For instance, I added kern.proc.<pid>.sigtramp for the debuggers and unwinders like libunwind to avoid using pre-defined value for the trampoline base to detect signal frames, but some tools are not converted, and old binaries cannot be fixed.  Similar concern for old libc' setproctitle(3).  Etc.

I suggest trying a different approach for implementing your workaround: if matching CPU is detected, decrement sv_usrstack and sv_shared_page_base by PAGE_SIZE.  I expect that the image activator is parametrized by struct sysentvec enough to make this work; if not, I will fix it.  For Linux 64 bit emul, similar adjustment for the Linux ABI sysentvec should be done at module init.

It is shame that AMD is silent and does not provide Erratas/Notifications of problems for their flagship CPUs.
Comment 93 Andriy Gapon freebsd_committer freebsd_triage 2017-07-24 08:48:23 UTC
(In reply to Don Lewis from comment #89)
Very interesting results! Thank you.

BTW, it should be possible to reassign interrupts cpuset -x, but not sure if that would work for the HPET interrupts as there is a little bit of magic about them.
Comment 94 Don Lewis freebsd_committer freebsd_triage 2017-07-24 16:08:51 UTC
My machine is still stable if I only shift the shared page location and leave the size unchanged.

Instead of decoding the CPU type to decide on whether to do this, I'm thinking of making it a tunable so that those with non-Ryzen hardware can test the the change for side effects and allow Ryzen users to test whether future microcode updates fix the problem.
Comment 95 Nils Beyer 2017-07-24 16:15:34 UTC
(In reply to Don Lewis from comment #94)

I have your complete patch running on two Ryzen machines. One is currently poudriering packages - still running. That is the machine that always freezes or reboots.

The other machine is performing a 20-threads endless buildkernel/buildworld into a TMPFS. Before the patch I got failures due to "unable to rename temporary..." errors. Until now, they have not re-appeared yet (21 passes so far).

So, the only patch I'll have to use is the change of the "SHAREDPAGE"-define and not the other ones, correct?
Comment 96 Don Lewis freebsd_committer freebsd_triage 2017-07-24 16:19:51 UTC
(In reply to Nils Beyer from comment #91)
Since the IRET instruction seems to play a factor in this problem, and it is supposedly microcoded, I'm hoping that a microcode fix is possible for it.

I would not expect this to fix problems found by ryzen_segv_test since that code does not use signals, so those processes should not be executing the signal trampoline code that my patch relocates.

It's curious that disabling SMT fixes that problem for you.  Can you try running ryzen_segv_test under "cpuset -l 2-15" to see if avoiding running the test on the core that handles interrupts makes a difference?
Comment 97 Don Lewis freebsd_committer freebsd_triage 2017-07-24 16:23:08 UTC
(In reply to Nils Beyer from comment #95)
I've seen the "unable to rename temporary..." errors, but only on an older version of 12.0-CURRENT.  I've never seen it with r320570, which I'm currently running.

Yes, only the #define needs to be changed.  The other two changes have undesirable side effects.
Comment 98 Nils Beyer 2017-07-24 16:39:10 UTC
(In reply to Don Lewis from comment #96)

unfortunately, "cpuset" doesn't help:
-----------------------------------------------------------------------
#cpuset -l 2-15 ./run.sh 16 2500000
Segmentation fault
-----------------------------------------------------------------------
Comment 99 Nils Beyer 2017-07-24 16:40:05 UTC
(In reply to Don Lewis from comment #97)

okay, rebuilt the kernels of my two machines with the SHAREDPAGE modification only. And restarted the stress tests...
Comment 100 Nils Beyer 2017-07-24 16:54:58 UTC
(In reply to Don Lewis from comment #97)

and I got the "unable to rename temporary" direct in the first pass on my buildkernel/buildworld system.

So, it seems that only this single change:
---------------------------------------------------------------------------
+#define        SHAREDPAGE              (VM_MAXUSER_ADDRESS - 2*PAGE_SIZE)
---------------------------------------------------------------------------

is not enough. I'll try with the whole patch again - and overnight...
Comment 101 Don Lewis freebsd_committer freebsd_triage 2017-07-24 16:56:57 UTC
(In reply to Nils Beyer from comment #75)
Nothing wrong with that backtrace.  Each thread gets its own stack, which is created by thread_start().
Comment 102 SF 2017-07-24 18:08:04 UTC
Are you sure you arent just masking the symptoms with this? I still think it is caused by ram.
Comment 103 Don Lewis freebsd_committer freebsd_triage 2017-07-24 18:20:16 UTC
(In reply to SF from comment #102)
I'm pretty sure that my machine doesn't have RAM problems.  I can build ports in an i386 jail with no problems.  Without the patch, building ports in an amd64 jail is flakey.  Shifting the location of the shared page fixes the problem and the machine can do an 8 1/2 hour port build run, hitting a load average of 60, with no errors.  Without the change, building openjdk7 is very likely to crash the machine.  With the change I can run openjdk7 builds all night without any errors.

I've got ECC RAM and I haven't seen any sign of even a single bit memory error in the logs.
Comment 104 SF 2017-07-24 18:40:29 UTC
Yeah, but enviromental circumstances like temperatures or minor processes running do affect this.

Noone hade sucess solving this by deactivating features like smt and so on and such errors happen random, not always at the same point after the same time.

You jail might just keep the load down enough to prevent the machine from exceeding its limits. You need to try this within more reporducable environments.
Comment 105 SF 2017-07-24 18:43:26 UTC
I tested myself various things and none of the benchmarks or heating causes the machine to have any errors but some processes do cause this errors and they are only processes which also cause a high load onto the machine. To me the only solution preventing the machine from crashing for a far longer time was setting some power-settings. It must not be caused by the ram itself, the ram can trigger missbehavior within the cpu.
Comment 106 Ivan Rozhuk 2017-07-24 22:03:38 UTC
Increase V SOC to 1,1V, this fix some strange issues with videocard for me.

I have no problems with build error: world+kernel, llvm, firefox, libre office - ok.
But I have strange reboots after 3-7 days of uptime.
Before I increase Vsoc I also have strange problems with videocards: 1-2 days before reboot tearing on video playback. And one time after tearing start aI reboot system and amd video driver fail to init on self test (some ring buf testing fail).
Comment 107 Don Lewis freebsd_committer freebsd_triage 2017-07-24 22:18:01 UTC
(In reply to SF from comment #104)
The temperature in my un-airconditioned office varies quite a bit and I've never been able to correlate the observed instability with temperature.  I first assembled the system before the heat of summer and it crashed.  These days it will crash during he cool night hours with the office window open and the cool outside air blowing on it.  It will crash in the late afternoon when the sun has been coming in through west-facing glass door and making the room uncomfortably hot.  It will crash under full load or with only two cores enabled in the BIOS and SMT off.  It will crash with the system fans under temperature control or turned on full blast.

With my patch, the I had no problems doing repeated openjdk builds during the cool overnight hours and no problems with a lengthy poudriere run on an especially warm day in my office.
Comment 108 Don Lewis freebsd_committer freebsd_triage 2017-07-24 22:26:03 UTC
(In reply to Nils Beyer from comment #91)
I'm pretty sure that ryzen_segv_test is actually broken.  The first iteration of the loop in the t2 threadx() is unlocked and there is no guarantee that it will have initialized things before thread1() tries to use them.

Try this patch:

--- ryzen_segv_test.c.orig	2017-07-24 14:26:23.851846000 -0700
+++ ryzen_segv_test.c	2017-07-24 15:02:33.998102000 -0700
@@ -291,29 +291,32 @@
 	atomic_store(&flg, 0);
 }
 
+void threadx_core()
+{
+	uint8_t offset;
+	uint32_t randval;
+
+	offset = random() % 256;
+	randval = random();
+	memset(func_set, 0, sizeof(func_set_t));
+	memcpy(&func_set->func[offset], func_base, FUNC_BYTES);
+	func_set->offset = offset;
+	func_set->ret = randval;
+}
+
 void threadx(void *p)
 {
 	uint8_t offset;
 	uint32_t randval;
 	int init = 0;
-	if(p != NULL) {
-		init = 1;
-	}
 	
 	//usleep(1000);
 
 	while(atomic_load(&flg)) {
 		offset = random() % 256;
 		randval = random();
-		if(!init) {
-			lock_enter();
-		} else {
-			if(func_set == MAP_FAILED) {
-				fprintf(stderr, "mmap returns MAP_FAILED!\n");
-				return;
-			}
-			init = 0;
-		}
+		lock_enter();
+		// threadx_core();
 		memset(func_set, 0, sizeof(func_set_t));
 		memcpy(&func_set->func[offset], func_base, FUNC_BYTES);
 		func_set->offset = offset;
@@ -330,8 +333,7 @@
 {
 	int64_t loops;
 	pthread_t t1, t2, t3;
-#ifdef _MSC_VER
-#else
+#if !defined(_MSC_VER) && !defined(__FreeBSD__)
 	cpu_set_t cpuset;
 	int cpu;
 #endif
@@ -349,19 +351,23 @@
 	n_cpus = sysconf(_SC_NPROCESSORS_ONLN);
 	func_set = mmap (NULL, sizeof(func_set_t), PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 #endif
+	if(func_set == MAP_FAILED) {
+		fprintf(stderr, "mmap returns MAP_FAILED!\n");
+		exit (1);
+	}
 
 	atomic_store(&flg, 1);
 	atomic_store(&locked, 1);
 	
 	srandom(time(NULL) + pid);
 	// You should confirm assembly of generated code, just in case the compiler reorders mfence instruction
+	threadx_core();
 	mfence(); // Assure that flags are stored properly
 	pthread_create(&t1, NULL, (void*)thread1, &loops);
-	pthread_create(&t2, NULL, (void*)threadx, (void*)1);
+	pthread_create(&t2, NULL, (void*)threadx, NULL);
 	pthread_create(&t3, NULL, (void*)threadx, NULL);
 	
-#ifdef _MSC_VER
-#else
+#if !defined(_MSC_VER) && !defined(__FreeBSD__)
 	cpu = random() % n_cpus;
 	CPU_ZERO(&cpuset);
 	CPU_SET(cpu, &cpuset);
Comment 109 Nils Beyer 2017-07-24 22:29:33 UTC
A little update:

1) after 13 passes my buildkernel/buildworld system threw two of the dreaded errors right after another - despite being patched with the maxi-version of Don's patch:
---------------------------------------------------------------------------------
1500932508.log:error: unable to rename temporary 'vnode.o-39c444c6' to output file 'vnode.o': 'No such file or directory'
1500933735.log:error: unable to rename temporary 'editline.o-ada8c3af' to output file 'editline.o': 'No such file or directory'
---------------------------------------------------------------------------------

So this behaviour is not fixed. *sigh* Trying a current CURRENT is not an option for me at the moment, sorry. Just out of curiosity; has there been any fixes regarding handling TMPFS in CURRENT lately?

2) my poudriere system (only the single-line patch applied) still builds; the reasons for the failed packages so far look legit to me; nothing that's unexpected to me. There are no "unable to rename temporary" errors yet, fortunately. But it's just been running for six hours now.
Comment 110 Nils Beyer 2017-07-24 22:39:02 UTC
(In reply to rozhuk.im from comment #106)

I've often read that increasing SoC voltage helps with things. Then I ask the mainboard vendors: if that voltage increasing is so freaking important, why isn't that voltage increased by default.

Anyway, have you already tried a poudriere build of all ports in order to force a crash?
Comment 111 Nils Beyer 2017-07-24 22:40:03 UTC
(In reply to Don Lewis from comment #107)

are you able to read temperatures using "mbmon" or my LUA script using "superiotool"?
Comment 112 Nils Beyer 2017-07-24 22:52:31 UTC
(In reply to Don Lewis from comment #108)

patch applied - load on all cores (incl. SMT ones) is 100%. But it seems that the loops never finish - they seem to be stuck:
-------------------------------------------------------------------------------
22080  1  I+    12:35.69 ./ryzen_segv_test 2500000
22083  1  I+    13:39.35 ./ryzen_segv_test 2500000
22084  1  I+    14:40.53 ./ryzen_segv_test 2500000
22087  1  I+    16:57.21 ./ryzen_segv_test 2500000
22090  1  I+    10:55.12 ./ryzen_segv_test 2500000
22091  1  I+    11:33.34 ./ryzen_segv_test 2500000
22092  1  I+    13:40.01 ./ryzen_segv_test 2500000
22093  1  I+    14:33.68 ./ryzen_segv_test 2500000
22094  1  I+    13:00.82 ./ryzen_segv_test 2500000
22095  1  I+    12:59.07 ./ryzen_segv_test 2500000
22096  1  I+    10:29.87 ./ryzen_segv_test 2500000
22097  1  I+    10:55.81 ./ryzen_segv_test 2500000
22098  1  I+    11:33.01 ./ryzen_segv_test 2500000
22099  1  I+    10:55.00 ./ryzen_segv_test 2500000
22100  1  I+    12:36.83 ./ryzen_segv_test 2500000
22101  1  I+    10:30.47 ./ryzen_segv_test 2500000
-------------------------------------------------------------------------------
Comment 113 Don Lewis freebsd_committer freebsd_triage 2017-07-24 23:19:27 UTC
(In reply to Nils Beyer from comment #112)
Hmn ... what I though of a bug may not have been a bug after all ...

I missed the fact that that main initially set the lock, so the first instance of threadx() should have properly initialized things afterall.  With the init code removed, that instance of threadx() just spins on the lock instead of doing the init stuff and then unlocking things and letting the other threads run.

In this code in main():
         atomic_store(&flg, 1);
         atomic_store(&locked, 1);
change the second line so that locked is initialized to zero.
Comment 114 Don Lewis freebsd_committer freebsd_triage 2017-07-24 23:24:48 UTC
(In reply to Nils Beyer from comment #111)
# mbmon
No Hardware Monitor found!!
InitMBInfo: No error: 0

# superiotool -de
superiotool r4.0-2827-g1a00cf0
No Super I/O found
Comment 115 Nils Beyer 2017-07-24 23:29:52 UTC
(In reply to Don Lewis from comment #113)

done:
-----------------------------------------------------------
-       atomic_store(&locked, 1);
+       atomic_store(&locked, 0);
-----------------------------------------------------------

result:
-----------------------------------------------------------
./run.sh 16 2500000
Segmentation fault
Segmentation fault
-----------------------------------------------------------
Comment 116 Nils Beyer 2017-07-24 23:34:59 UTC
(In reply to Don Lewis from comment #114)

damn, sorry, out of ideas here - Gigabyte must be using some fancy sensor chip there...
Comment 117 Don Lewis freebsd_committer freebsd_triage 2017-07-24 23:41:06 UTC
(In reply to Nils Beyer from comment #109)
When I first switched to my current motherboard, I wanted to do some before and after comparisons of the stack guard changes.  I don't happen to remember what svn revision I was working with, but I was plagued by the rename bug.  I'm pretty sure it was somewhere between the ino64 changes and the stack guard changes.  It bit me once on a buildworld/buildkernel with /usr/obj on ZFS.  It broken quite a few poudriere runs where I use tmpfs for most things.  I also saw failures when I ran poudriere with tmpfs disabled.  For me it wasn't confined to tmpfs, but it *seemed* to happen more often with tmpfs.  It was bad enough that it hid a bunch of the other stability issues.

After serveral days, I upgraded to a different svn revision and the problem went away.  I'm currently on r320570 and haven't seen the problem in weeks.
Comment 118 SF 2017-07-25 02:14:53 UTC
Noone of us can measure the temperature of the voltage-regulation of the cpu, i talked about this before and i did read about people who managed to add some additional cooling to it to succesfully stabilize their system. Noone of you did try this before. Windows-people say its overheating and it also confirms my experience, i didn't know this at the time i found out.
Comment 119 Don Lewis freebsd_committer freebsd_triage 2017-07-25 07:05:58 UTC
(In reply to Nils Beyer from comment #115)
I have some suspicions about what might be going wrong with ryzen_segv_test, but I really don't understand the memory fence / serialization stuff well enough to be sure.

An experiment that I performed to try to get a better idea of where things might be going off the rails was to add this code:
                if (func_set->func[func_set->offset] != 0x8b) {
                        fprintf(stderr, "First opcode should be 0x8b, but found
0x%x\n", func_set->func[func_set->offset]);
                }
to thread1() in between the
                pf = (func_t)(&func_set->func[ func_set->offset ]);
and
                ret2 = pf(func_set);
to verify that the expected opcode was actually where we plan to jump to.  What was interesting is that the error never triggered, *but* the frequency of segfaults went way down.

That led me to look at what the mfence instruction actually does:
    Acts as a barrier to force strong memory ordering (serialization) between
    load and store instructions preceding the MFENCE, and load and store
    instructions that follow the MFENCE. A weakly-ordered memory system
    allows the hardware to reorder reads and writes between the processor
    and memory. The MFENCE instruction guarantees that the system completes
    all previous memory accesses before executing subsequent accesses.

    The MFENCE instruction is weakly-ordered with respect to data and
    instruction prefetches.
Note the last sentence!

The mfence() at the end of the threadx() loop should flush all the pending writes out to cache associated with core A before the lock is unlocked and thread1() is permitted to do its work.

It looks to me like the mfence() at the top of the thread1() loop would not do anything since there aren't any interesting loads or stores that might be outstanding for core B at that point.  One would think that serialize() aka the cpuid instruction would prevent instruction prefetching of old stale data before that point, but maybe not.  I just don't understand this well enough ...

Next I tried moving
                mfence();
                serialize();
from just after lock_enter*() to just before the call into the newly moved
data array:
                ret2 = pf(func_set);
This gives mfence() some memory loads to wait for, which allows the data to be migrated from the core A cache.  With this change, I no longer get any segfaults.

Ryzen bug?  Just more aggressive prefetching?  I don't know ...
Comment 120 Konstantin Belousov freebsd_committer freebsd_triage 2017-07-25 10:59:27 UTC
For <jemalloc>: jemalloc_arena.c:821: Failed assertion: nstime_compare(&decay->epoch, &time) <= 0" try the patch I posted at https://reviews.freebsd.org/D11728 .
Comment 121 Nils Beyer 2017-07-25 13:30:58 UTC
(In reply to Don Lewis from comment #119)

> This gives mfence() some memory loads to wait for, which allows the data to be migrated from the core A cache.  With this change, I no longer get any segfaults.

confirmed - with that change, I haven't gotten any segfaults in 500 passes. Though, there is a discrepancy in how many passes each core has absolved:
---------------------------------------------------------------------------
[...]
412: Tue Jul 25 15:19:00 CEST 2017: OK
405: Tue Jul 25 15:19:01 CEST 2017: OK
402: Tue Jul 25 15:19:01 CEST 2017: OK
420: Tue Jul 25 15:19:01 CEST 2017: OK
410: Tue Jul 25 15:19:01 CEST 2017: OK
406: Tue Jul 25 15:19:01 CEST 2017: OK
410: Tue Jul 25 15:19:01 CEST 2017: OK
414: Tue Jul 25 15:19:01 CEST 2017: OK
410: Tue Jul 25 15:19:01 CEST 2017: OK
409: Tue Jul 25 15:19:02 CEST 2017: OK
413: Tue Jul 25 15:19:02 CEST 2017: OK
423: Tue Jul 25 15:19:02 CEST 2017: OK
397: Tue Jul 25 15:19:02 CEST 2017: OK
411: Tue Jul 25 15:19:02 CEST 2017: OK
401: Tue Jul 25 15:19:02 CEST 2017: OK
421: Tue Jul 25 15:19:02 CEST 2017: OK
438: Tue Jul 25 15:19:02 CEST 2017: OK
427: Tue Jul 25 15:19:02 CEST 2017: OK
406: Tue Jul 25 15:19:02 CEST 2017: OK
---------------------------------------------------------------------------

In my eyes, each core is performing the same workload and should therefore be at the same pass number. Maybe I'm completely wrong. But isn't that something you've observed, too, is it?


> Ryzen bug?  Just more aggressive prefetching?  I don't know ...

It's a rather difficult question: if CPU A executes something without segfaults; and CPU B throws segfaults using the same executable, does that automatically mean that CPU B is doing it all wrongly? Or does it rather mean CPU B is not 100% compatible to CPU A and therefore needs an appropiate executable?

I ask because I wonder if that's something that should be told to AMD tech support - particularly because I have an open ticket there...
Comment 122 Don Lewis freebsd_committer freebsd_triage 2017-07-25 16:23:42 UTC
(In reply to Konstantin Belousov from comment #120)

I haven't seen this particular problem since I relocated the shared page.  Not to say that the problem is gone, but it has always been very sporadic.  That would also make it difficult to verify a fix.

I just kicked off another full port build run and will be on the lookout for any failures.
Comment 123 SF 2017-07-25 16:34:16 UTC
And thats why i think that it is no software-problem because it is too random, i did never reproduce it the same way like before. It's always been different.
Comment 124 Don Lewis freebsd_committer freebsd_triage 2017-07-25 17:05:21 UTC
(In reply to Nils Beyer from comment #121)
I let it run overnight here and got 16000+ passes w/o error.

I also see the same variation:

16680: Tue Jul 25 08:04:36 PDT 2017: OK
16640: Tue Jul 25 08:04:36 PDT 2017: OK
16678: Tue Jul 25 08:04:36 PDT 2017: OK
16699: Tue Jul 25 08:04:37 PDT 2017: OK
16719: Tue Jul 25 08:04:37 PDT 2017: OK
16813: Tue Jul 25 08:04:37 PDT 2017: OK
16684: Tue Jul 25 08:04:37 PDT 2017: OK
16687: Tue Jul 25 08:04:37 PDT 2017: OK
16737: Tue Jul 25 08:04:37 PDT 2017: OK
16758: Tue Jul 25 08:04:37 PDT 2017: OK

This isn't too surprising since there are more threads than cores and the scheduler won't be totally fair about keeping the load on each core balanced, so the wall clock time for each process will vary a bit.  Over time there will be some dispersion in the number of processes executed by each run1 instance.

I don't know whether the segfaults in this example count as a bug or not.  The architecture spec should say that for this sort of thing you should do A, B, and C.  It may be the case that if you don't strictly follow the spec that your code will run on CPU A, but not CPU B.

I forgot to mention the uop cache.  I'm wondering if it automatically gets invalidated when writes are detected to the instruction locations that it has cached decoded instructions for.  Note this statement about self-modifying code:
  The micro-op cache is filled by the conventional instruction-fetch-and-decode
  pipeline, but it’s neither inclusive nor exclusive of the L1 instruction
  cache. As a result, self-modifying code is more difficult, as it must check
  and potentially invalidate both caches. Since the TLBs are earlier in the
  pipeline, the micro-op cache may be physically addressed, unlike Intel’s
  virtually addressed micro-op cache.
that I found here:
  http://www.neogaf.com/forum/showthread.php?t=1342455&page=1
I have read that AMD has been suggesting that people having stability problems try disabling the uop cache.  The BIOS on my board does not have an option for that.

I think this code is trying to test for the ASLR problem that at lot of Linux users have run into.  It's a poor match for that, though.  ASLR doesn't use self modifying code, it always starts with a fresh process each time and just maps stuff into randomly chosen locations each time.  If you run the same program several times, the memory contents might look like
   A   B              C
      B     A                C
etc.  To the CPU this is shouldn't be any different than running cat, make, and sh.
Comment 125 SF 2017-07-25 17:31:59 UTC
Dude, thats ridiculous. You have no clue what you are doing, if you really want to find out whats going on then you need some debugging on asm-level and the specifications of this cpu. Then you need to know how exactly the faulty software interacts with the cpu to get whats the problem.

All you do is scratching a bit at the top level without exactly understanding whats going on.

We all know this happens under high loads, this indicates more likely an hardware-problem and not something on software-side because software has much higher chances to break with the same failures even on low loads.
Comment 126 Don Lewis freebsd_committer freebsd_triage 2017-07-25 19:13:11 UTC
(In reply to Nils Beyer from comment #109)
Hey, I just got this during my latest ports build run, this time during the build of gstreamer1:

error: unable to rename temporary '.libs/libgstreamer_1.0_la-gsttoc.o-9edd16d6'
to output file '.libs/libgstreamer_1.0_la-gsttoc.o': 'No such file or directory'
1 error generated.

It's baack ...
Comment 127 Nils Beyer 2017-07-25 19:55:49 UTC
poudriere is still building after 27 hours - but it has done that before; so I'm not confident, yet.

50 ports failed; some because of C/C++ errors; some - especially the GO language based ports due to bus errors and stuff. This run started only with 18332 ports because the previous run already had built some before I've applied the one-line patch.

No "unable to rename temporary" errors, yet.

If someone wants to look at the error logs, I've attached them here...
Comment 128 Nils Beyer 2017-07-25 20:01:44 UTC
Created attachment 184707 [details]
logs of failed poudriere builds
Comment 129 Nils Beyer 2017-07-25 20:11:52 UTC
(In reply to Don Lewis from comment #126)

well, I'd say "welcome back" if it's an ever-welcome guest; but no.

Perhaps one shouldn't put so much attention to that error. It may be not an indicator for a pending freeze/reboot...
Comment 130 Nils Beyer 2017-07-25 20:14:02 UTC
At least, there's more activity in the AMD forum's thread. One guy has openend a bug report in the Linux kernel bugtracker:

    https://bugzilla.kernel.org/show_bug.cgi?id=196481
Comment 131 Don Lewis freebsd_committer freebsd_triage 2017-07-25 22:51:59 UTC
(In reply to Don Lewis from comment #126)
A total of three ports failed this time around, gstreamer1, thunderbird, and openoffice-devel.  All were the rename failure.  The total build time was almost six ours for ~1500 ports.  The failure of the gstreamer1 port caused a bunch of other ports to be skipped.

The only go ports that I build are lang/go14 and lang/go.  I've seen intermittent failures of the latter.  The usual manifestation is that go calls its panic() routine due to what looks like corruption in its malloc() data structures.  It is heavily threaded and I haven't seen any problems with it since I relocated the shared page.
Comment 132 Nils Beyer 2017-07-25 22:57:06 UTC
poudriere finished with 56 failed builds. No system freezes or reboots so far. *thumbsup*

I've started poudriere again to try the failed builds again; for example: "gcc5-devel" now built successfully, whereas before it generated a "bus error".

Anyways, current goal is to provoke a system crash...
Comment 133 Nils Beyer 2017-07-25 23:09:58 UTC
(In reply to Nils Beyer from comment #132)

you know what? Forget that. Because 11.1-RELEASE is out, I'll start a complete new build of all ports. And I'll reboot my system beforehands...
Comment 134 Don Lewis freebsd_committer freebsd_triage 2017-07-25 23:10:45 UTC
(In reply to Don Lewis from comment #124)
The AMD documentation that I've found is pretty much silent about cross-modifying code.  My interpretation of the pseudo-code that I found in the Intel documentation is that nothing other than the lock followed by CPUID is necessary.  They don't mention anything about the need for fence instructions.

Is this a Ryzen bug?  Yes, it looks like one to me.  Do I care, probably not.  I don't need to run contrived cross-modifying code.  I don't think this affects FreeBSD.  I suspect that anyplace that we do something like this there is likely be some code that changes the page permissions to remove write access and adds execute access between where the code is written and where it is executed.  That is probably sufficient distance to flush out the inconsistent state from the CPU before it tries to execute the code.  If this was not the case, I would expect that FreeBSD would hardly run at all on Ryzen.

In any case, I think there is a software workaround.  Just do a read access of the newly written instructions before the CPUID.  Maybe a fence instruction is needed as well.

I'm surprised that this problem seems to be worse with SMT.  I would think that cross-modifying between threads on the same core would be the best case and running on cores belonging to different CCXes would be the worst case ...
Comment 135 Ivan Rozhuk 2017-07-25 23:11:57 UTC
(In reply to Nils Beyer from comment #110)

No.
I build all my systems without jail and I do not use poudriere at all.

I never see that build fail and rebuilds magically ok.

My mobo is asrock taichi and I get boot loop with code '00' in 3-8 days of uptime.
On other vendor computers just reboot.
Bootloop IMHO because some hw error may be stored in mce registers and after softreboot it does not clean, and asrock bios does not clean it but check at the end of init stage '00'. It see some err and do soft reset again and again.
Until I power off or press reset button.

Does FreeBSD MCE implementation works fine?
Comment 136 Nils Beyer 2017-07-25 23:15:54 UTC
AMD support answered; and they asked me to do a complete CMOS reset by detaching all cables from the computer and removing the CMOS battery. Should leave that for five minutes, put everything back in and should raise my VCORE voltage to 1.3625 V; leaving the rest on AUTO.

*sigh*
Comment 137 Ivan Rozhuk 2017-07-25 23:21:08 UTC
(In reply to Nils Beyer from comment #136)

They told do same gay from russian forum with linux.
Next they asks photos with cooling system...
Comment 138 Nils Beyer 2017-07-25 23:28:09 UTC
(In reply to Don Lewis from comment #134)

thanks for the explanations. So, that specific Ryzen bug is not responsible for the strange compilation failures (segfaults/bus errors/unable to rename), correct?

What do you suggest to try in order to force these failures within a minute timeframe then?
Comment 139 Nils Beyer 2017-07-25 23:34:36 UTC
(In reply to rozhuk.im from comment #135)

okay, the Taichi is one of the top-notch AM4 mainboards; 12+4 CPU phases - should be plenty.


> Does FreeBSD MCE implementation works fine?

at least, it logs any MCE messages in kernel - but live; if your board catches it before the kernel can do, I think you're out of luck here.

Maybe you can try Don's one-liner patch which I've attached and rebuild your kernel...
Comment 140 Nils Beyer 2017-07-25 23:35:14 UTC
Created attachment 184714 [details]
Don's Ryzen patch - stripped to the SHAREDPAGE define only...
Comment 141 Nils Beyer 2017-07-25 23:39:03 UTC
(In reply to rozhuk.im from comment #137)

I wonder if these support guys ever read the URLs people tell them - probably not.

No matter, I'm not doing as support requested; I'd like to let poudriere build things first and see if the SHAREDPAGE shifting fixes these freezes/reboots...
Comment 142 Ivan Rozhuk 2017-07-25 23:52:20 UTC
IMHO it can fix soft fails, but not reboots.
Comment 143 Don Lewis freebsd_committer freebsd_triage 2017-07-25 23:53:17 UTC
(In reply to Nils Beyer from comment #138)
I seriously doubt it.

You could try writing a script that writes to and then renames files and run multiple instances in parallel to emulate what a parallel build looks like.  Then periodically clean the directory.
Comment 144 Don Lewis freebsd_committer freebsd_triage 2017-07-25 23:54:45 UTC
(In reply to Nils Beyer from comment #136)
They left out the most important part ... shaking a rubber chicken over it.
Comment 145 Don Lewis freebsd_committer freebsd_triage 2017-07-26 00:03:58 UTC
(In reply to rozhuk.im from comment #135)
It's sort of unknown here about how "fine" the machine check code works, but I have seen reports of errors getting logged.  It is sometimes possible for software to provoke these errors, but that is unusual.  Correctable errors should be fairly benign.  If you are getting uncorrectable errors, that will probably cause the machine to reboot.  In either case it is likely to be a hardware issue of some sort.

I've gotten one correctable error on my machine in the last week:

Jul 22 17:17:17 speedy kernel: MCA: Bank 1, Status 0x90200000000b0151
Jul 22 17:17:17 speedy kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000
000000000000
Jul 22 17:17:17 speedy kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID
14
Jul 22 17:17:17 speedy kernel: MCA: CPU 14 COR ICACHE L1 IRD error

This looks like a correctable level 1 instruction cache read error.  It doesn't seem to have had any side effects.


What error are you getting?
Comment 146 Nils Beyer 2017-07-26 07:20:59 UTC
(In reply to Nils Beyer from comment #133)

poudriere is running for 8 hours now - , 4341 built, 4 failed builds, 2 of them due to "unable to rename temporary":
---------------------------------------------------------------------------------
/usr/local/poudriere/data/logs/bulk/11_1-default/latest/logs/errors/webkit2-gtk3-2.16.6.log:error: unable to rename temporary 'Source/WebCore/CMakeFiles/WebCoreDerivedSources.dir/__/__/DerivedSources/WebCore/JSComment.cpp.o-f48c4f6d' to output file 'Source/WebCore/CMakeFiles/WebCoreDerivedSources.dir/__/__/DerivedSources/WebCore/JSComment.cpp.o': 'No such file or directory'
/usr/local/poudriere/data/logs/bulk/11_1-default/latest/logs/errors/kf5-kservice-5.36.0.log:error: unable to rename temporary 'src/CMakeFiles/KF5Service.dir/services/kservice.cpp.o-7cd16444' to output file 'src/CMakeFiles/KF5Service.dir/services/kservice.cpp.o': 'No such file or directory'
---------------------------------------------------------------------------------

I'm ignoring that - goal is a crash...
Comment 147 Nils Beyer 2017-07-26 13:32:14 UTC
(In reply to Nils Beyer from comment #146)

still running for 14 hours now - no freezes/reboots. Built: 8015 - Failed: 9

Apart from the known SIGBUS Java error, I got a strange signal 5 (SIGTRAP) building "chromium":
-------------------------------------------------------------------------------
[0726/093727.533851:FATAL:ref_counted.cc(26)] Check failed: in_dtor_. RefCountedThreadSafe object deleted without calling Release()
-------------------------------------------------------------------------------

But that's not the topic for me now.

Would it make sense if I create a new ticket for all the strange compilation errors (segfaults/bus errors/unable to rename) in which a second compilation immediately afterwards succeeds?

So this ticket here is about the system freeze/reboot problem only...
Comment 148 Nils Beyer 2017-07-26 16:22:19 UTC
(In reply to Nils Beyer from comment #136)

just for the record, that's the response I've sent to AMD support:
------------------------------------------------------------------------------------------------
Dear ...,

thank you for your response. At the moment, I'm testing an unofficial FreeBSD kernel patch as a possible workaround for the system freezes and system reboots. This workaround is somewhat related to what Mr. Matthew Dillon reported to you in April:

    "There appears to be an issue with the kernel iretq'ing to a %rip near the end of the user address space (top of stack)."

After my current stress test is finished, I'll try what you've suggested. But that could delay up to 48 hours - so please be patient with me...
------------------------------------------------------------------------------------------------

I hope that they've catched my emphasis on "workaround"...
Comment 149 Don Lewis freebsd_committer freebsd_triage 2017-07-26 17:10:11 UTC
(In reply to Nils Beyer from comment #147)
Yes, opening a new ticket would make sense.

So far I have not been able to reproduce the rename failure with a synthetic test.

To try to determine if it is limited to tmpfs, I've been running poudriere with tmpfs disabled.  I got a clean run yesterday.  I started it again last night and sometime during the night gmime26 failed to build properly.  No obvious errors.  It looks like the compiler just returned a non-zero exit status.  I've seen this before.  This run should be finished in about an hour.

Too bad you're not running ECC RAM, that would eliminate one potential silent cause of strange behavior.
Comment 150 Nils Beyer 2017-07-26 17:31:47 UTC
(In reply to Don Lewis from comment #149)

> Yes, opening a new ticket would make sense.

done:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029

my response to your comment is there...
Comment 151 Mark Millard 2017-07-26 18:49:11 UTC
(In reply to Nils Beyer from comment #150)

Which bugzilla report does the:

---------------------------------------------------------------------------
MCA: Bank 1, Status 0x90200000000b0151
MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
MCA: CPU 14 COR ICACHE L1 IRD error
---------------------------------------------------------------------------

type of report go against? Or have these been
too infrequent to associate with anything that
cuts the frequency of them? Have the problems
gone away (if that can be judged)?

Should there be a bugzilla report covering
getting these instead of just leaving the
material in a large accumulation of other
material?
Comment 152 Nils Beyer 2017-07-26 18:59:30 UTC
(In reply to Mark Millard from comment #151)

yes, sorry; completely forgot these MCA messages. These occur during compilation orgies. Infrequent, but they happen. So I've put some examples into my compilation bug report.

I don't think that this needs to got into seperate bug report, but I cannot judge that for sure. If you say so, I'll open a third report...
Comment 153 Nils Beyer 2017-07-26 19:00:08 UTC
(In reply to Nils Beyer from comment #152)

they still happen - even with Don's patch...
Comment 154 Mark Millard 2017-07-26 19:08:20 UTC
(In reply to Don Lewis from comment #149)

It is my understanding that many Ryzen 7
boards allow ECC memory but do not use the ECC
functionality: either the board is not wired
or else the BIOS does not configure for it.

For example:

7A32v1.0(G52-7A321X1)(X370 GAMING PRO CARBON).pdf

says:

Supports ECC UDIMM memory (non-ECC mode)

(It does not say if the board is wired to allow
ECC or not so I've no clue if a BIOS update
might be able to enable support of ECC mode.)

So it appears that:

"Too bad you're not running ECC RAM, that would eliminate one potential silent cause of strange behavior"

needs more context, such as "on a board that
uses ECC mode (when appropriately configured
if needed)".

It may be that most folks would not have ECC
ram put to use for their boards.

(The X370 GAMING PRO CARBON that I had access
to was turned into a brick by a BIOS update
attempt and I do not yet have access to another
Ryzen system.)
Comment 155 Mark Millard 2017-07-26 19:13:27 UTC
(In reply to Nils Beyer from comment #153)

The MCA reports probably should be mentioned
in interactions with AMD, including that the
change that avoids panics does not eliminate
the MCA reports.
Comment 156 Nils Beyer 2017-07-26 19:14:20 UTC
(In reply to Mark Millard from comment #154)

wouldn't tell "dmidecode" tell the truth (from my 16GB non-poudriere system):
-------------------------------------------------------------------------------
#dmidecode -t memory
# dmidecode 3.1
Scanning /dev/mem for entry point.
SMBIOS 3.0.0 present.

Handle 0x0032, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 64 GB
        Error Information Handle: 0x0031
        Number Of Devices: 4
[...]
Handle 0x003B, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0032
        Error Information Handle: 0x003A
        Total Width: 128 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM_A2
        Bank Locator: BANK 1
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2133 MT/s
        Manufacturer: Samsung
        Serial Number: 359EF4B0
        Asset Tag: Not Specified
        Part Number: M391A2K43BB1-CPB    
        Rank: 2
        Configured Clock Speed: 1067 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
-------------------------------------------------------------------------------
?
Comment 157 Nils Beyer 2017-07-26 19:16:50 UTC
(In reply to Mark Millard from comment #155)

understood, as soon as I'm certain that my system doesn't freeze/reboot anymore, I'll tell AMD support exactly as you suggest.

BTW:
--------------------------------------------------------------------------------
[11_1-default] [2017-07-26_01h13m51s] [parallel_build:] Queued: 27317 Built: 11894 Failed: 21    Skipped: 899   Ignored: 78    Tobuild: 14425  Time: 20:02:11
--------------------------------------------------------------------------------

and still running...
Comment 158 Mark Millard 2017-07-26 19:20:37 UTC
(In reply to Nils Beyer from comment #150)

For the problem:

"<jemalloc>: jemalloc_arena.c:821: Failed assertion: nstime_compare(&decay->epoch, &time) <= 0"

what buzilla should be used? Should it
have a new, separate one?

This is an issue where Konstantin Belousov had
reported:

try the patch I posted at https://reviews.freebsd.org/D11728 .

So it might be considered to be on a separate
investigative line than the rest of the issues.

As I understand it is not a source of panics and so
bugzilla 219399 would not seem a good fit.
Comment 159 Mark Millard 2017-07-26 19:33:09 UTC
(In reply to Nils Beyer from comment #156)

Since the Ryzen itself supports ECC memory
I do not know if that would automatically
cause a report that indicates ECC support
overall but not DIMM by DIMM when ECC
mode is not supported.

In what you included, the section:

Handle 0x003B, DMI type 17, 40 bytes
Memory Device

does not show the DIMM as ECC in any way
as far as I can tell.

(I am currently without access to a Ryzen
system. The original had a failed BIOS
upgrade attempt so I can not see what
would be reported for it.)
Comment 160 Don Lewis freebsd_committer freebsd_triage 2017-07-26 19:38:44 UTC
(In reply to Mark Millard from comment #151)
That's a correctable level 1 instruction cache error.  The hardware should automagically correct it and it should not affect proper operation.  No bug report is necessary.  If your logs are getting spammed with these then you might want to ask AMD about a replacement, otherwise it's not something to be concerned about.
Comment 161 Don Lewis freebsd_committer freebsd_triage 2017-07-26 19:47:13 UTC
(In reply to Nils Beyer from comment #153)
I suspect that there is a pattern sensitivity issue in the icache since I've seen quite a few reports of this.  Nothing that we do in software should affect this.  The same is true of any microcode fixes from AMD.  As long as the errors aren't too frequent (a hard stuck bit would cause very frequent errors) and they are always correctable, then this shouldn't be an issue.

If you happen to be in communication with AMD, you might want to tell them how frequently you see these.
Comment 162 Don Lewis freebsd_committer freebsd_triage 2017-07-26 20:00:05 UTC
(In reply to Mark Millard from comment #154)
Unfortunately that is true.  That is one of the reasons why I switched from my previous Gigabyte AB350 board to the AX370-GAMING 5 after spending more on RAM than CPU for this build.  The specs on the previous board were sufficiently ambiguous that I though I could save $100 on a motherboard and still have working ECC.

Unlike the earlier AM2->AM3+ Gigabyte boards that I have used that had all sorts of BIOS knobs for ECC, the BIOS on this board has no mention of ECC at all.  It just silently enables it if it sees ECC RAM.  I confirmed that by booting Linux and with dmidecode.  Would be nice if the FreeBSD kernel mentioned it ...

Unfortunately, Ryzen support wasn't added to memtest86 until today.  The previous version just reported ECC unknown.
Comment 163 Don Lewis freebsd_committer freebsd_triage 2017-07-26 20:09:38 UTC
(In reply to Mark Millard from comment #159)
This is what dmidecode reports on my machine:
Handle 0x0027, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 64 GB
	Error Information Handle: 0x0026
	Number Of Devices: 4

Handle 0x002E, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x0027
	Error Information Handle: 0x002D
	Total Width: 128 bits
	Data Width: 64 bits
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 0
	Bank Locator: CHANNEL A
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2400 MHz
	Manufacturer: Micron Technology
	Serial Number: 14C07593
	Asset Tag: Not Specified
	Part Number: 18ASF2G72AZ-2G3B1   
	Rank: 2
	Configured Clock Speed: 2400 MHz
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
[snip]

Note the "Error Correction Type: Multi-bit ECC".  The individual DIMMs are a bit strange.  On my other machines the ECC bits are included in "Total Width".  I have visually verified the extra RAM chips for the ECC bits are present.

This is what I see on my FX-8320E machine:

Handle 0x002C, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 32 GB
	Error Information Handle: Not Provided
	Number Of Devices: 4

Handle 0x002E, DMI type 17, 34 bytes
Memory Device
	Array Handle: 0x002C
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM0
	Bank Locator: BANK0
	Type: DDR3
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 1600 MT/s
	Manufacturer: Micron       
	Serial Number: 0D43FEC
	Asset Tag: AssetTagNum0
	Part Number: 18JSF1G72AZ-1G6E1 
	Rank: 2
	Configured Clock Speed: 800 MT/s
Comment 164 Mark Millard 2017-07-26 20:17:12 UTC
(In reply to Don Lewis from comment #160)

Merely seeing detected (and corrected)
single(?)-bit errors on a fairly regular basis
(instead of very rare even on a very busy PC)
is possibly worrisome of itself:

Detected single bit errors suggest it is also
more likely that undetected multi-bit errors
also occur more often.

If I understand the observed frequency of
observation for some systems right I personally
would be more likely to let AMD know if I was
in contact.

(But I'm without access to a Ryzen system for
now due to an attempted BIOS update that
failed. At this point I'm just monitoring the
status of the various issues for possible future
use.)
Comment 165 Mark Millard 2017-07-26 20:30:58 UTC
(In reply to Don Lewis from comment #163)

Note: I can not claim to know for

In:

Physical Memory Array
. . .
	Error Correction Type: Multi-bit ECC

I wonder if it says that even with non-ECC RAM in
one or more (or all) DIMMs. That section is not
DIMM specific at all from what I can tell. But
may be some of it is a summary over all DIMMs?
(My guess is: no.)

In:

Memory Device
. . .
	Error Information Handle: 0x002D
	Total Width: 128 bits
	Data Width: 64 bits

vs. the older, narrower context's

Memory Device
. . .
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits

The 128 bits instead of 128+8 or 128+2*8 bits
suggests lack of ECC use at the DIMM in
question.

But I'm unable to test a Ryzen context for now
so take my notes in this area as speculation
(or as a suggestion for contrasting tests that
I might have performed if the Ryzen system still
worked, presuming some non-ECC RAM is available
to substitute temporarily).
Comment 166 Don Lewis freebsd_committer freebsd_triage 2017-07-26 20:38:46 UTC
(In reply to Mark Millard from comment #165)
... and I don't have any non-ECC RAM that I could try.
Comment 167 Ivan Rozhuk 2017-07-26 21:34:52 UTC
(In reply to Mark Millard from comment #154)
MCE != ECC
MCE work even without ECC.
Comment 168 Mark Millard 2017-07-26 21:43:10 UTC
(In reply to Don Lewis from comment #163)

Table 73 ("Memory Device (Type 17) structure") of:

http://www.dmtf.org/sites/default/files/standards/documents/DSP0134_3.1.1.pdf

(System Management BIOS (SMBIOS) Reference Specification)

reports:

Total width, in bits, of this memory device, including
any check or error-correction bits. If there are no
error-correction bits, this value should be equal to
Data Width. If the width is unknown, the field is set
to FFFFh.

Also:

Data width, in bits, of this memory device. A Data Width
of 0 and a Total Width of 8 indicates that the device is
being used solely to provide 8 error-correction bits. If
the width is unknown, the field is set to FFFFh.

(It is indicates as applying to specification versions:
2.1+ .)



So for the 128 in:

Memory Device
. . .
	Error Information Handle: 0x002D
	Total Width: 128 bits
	Data Width: 64 bits

Total Width != Data Width is supposed to imply
error correction bits but I doubt that many.

It seems odd and is probably not according to the
specification. So it is not clear that the ECC
status can be inferred. The factor of 2 between
64 and 128 is more likely from something like
dual-channel or some such with any error correction
bits ignored if present/in-use.

This tends to suggest that interpreting the dmidecode
output for such things can be a problem to rely on.
Comment 169 Ivan Rozhuk 2017-07-26 21:45:19 UTC
(In reply to Don Lewis from comment #162)

Gigabyte have only 3 mobo for AM4 socket with good VRM (I mean VRM can provide power for 95W CPU on long load cycles without additional cooling).
Gaming 5, Gaming k7 and may be AB350N-Gaming WiFi.
Gaming 5, Gaming k7 - have ECC support.


I try this: https://reviews.freebsd.org/D9824?id=25815
on my asrock taichi and samsung ecc mem - it says that I have no ECC mem in system. But other peoples with same hw report that ecc ok.
Comment 170 Nils Beyer 2017-07-26 21:57:10 UTC
(In reply to rozhuk.im from comment #169)

cool, that module is already included in in 11.1-RELEASE (didn't know that). But "dmesg" says after kldloading on my second system (with 16GB ECC RAM):
-----------------------------------------------------------------------------
DRAM ECC is not supported or disabled
-----------------------------------------------------------------------------
:-(
Comment 171 Mark Millard 2017-07-26 21:59:53 UTC
(In reply to rozhuk.im from comment #167)

Machine Check Exceptions (MCEs) report various
errors that are detected in various ways.

For RAM: check or error correction bits (extra
bits) are one way of of detecting RAM data
problems. (ECC is one form of error correction
bits.)

Without redundancy via the extra bits, MCE has
no source of information to report the specific
type of error for RAM and simply does not report
what is then not detected as a problem. In other
words: no machine check exception then happens.

See "Problem Types" in

https://en.wikipedia.org/wiki/Machine-check_exception

For a short mention of "Memory Errors" that mentions
how they are detected for Machine Check Exceptions.
Comment 172 Mark Millard 2017-07-26 22:08:17 UTC
(In reply to Don Lewis from comment #166)

Looks like you could try:

https://reviews.freebsd.org/D9824?id=25815

that rozhuk.im@gmail.com reported (thanks!).

Nils Beyer reports that for him and his system
with ECC memory in it (comment #170):

cool, that module is already included in in 11.1-RELEASE (didn't know that). But "dmesg" says after kldloading on my second system (with 16GB ECC RAM):
-----------------------------------------------------------------------------
DRAM ECC is not supported or disabled
-----------------------------------------------------------------------------

So that would seem to be a way to confirm
or deny the ECC status when ECC memory is
present.
Comment 173 Mark Millard 2017-07-26 22:16:59 UTC
(In reply to Mark Millard from comment #172)

I should have also noted that Nils Beyer
has earlier reported for that "not
supported" board (with 16 GB):

Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 64 GB
        Error Information Handle: 0x0031
        Number Of Devices: 4

indicating that this dmidecode section is not
based on evaluating all the DIMMs.
Comment 174 Don Lewis freebsd_committer freebsd_triage 2017-07-26 22:52:41 UTC
(In reply to Nils Beyer from comment #170)
Hmn, same here.  Something to check is the AMD Family 17H docs to see if this module needs any changes.

Something else to try is the very latest memtest86 (released today).

Lacking either of the above, I did boot Linux and it reported that EDAC was available.

If you search the part number of the RAM in my dmidecode output, you'll find
this:
  https://www.micron.com/parts/modules/ddr4-sdram/mta18asf2g72hz-
2g3?pc=%7BE1D8F1A9-3DFC-4BD2-8A1E-C26ED261EB0A%7D
though it's not in SODIMM format.  It's actually Crucial-branded.  This particular kit:
  http://www.crucial.com/usa/en/ct4k16g4wfd824a
and wow, the price went up a bunch since my purchase.  The kit consists of four sticks of this:
  http://www.crucial.com/usa/en/ct16g4wfd824a
which was listed on Gigabyte's QVL list at the time of my purchase.
Comment 175 Ivan Rozhuk 2017-07-26 23:04:54 UTC
(In reply to Nils Beyer from comment #170)

I open new bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221038


(In reply to Don Lewis from comment #174)
No docs! :(
https://community.amd.com/message/2800088#comment-2800088
http://support.amd.com/en-us/search/tech-docs
Comment 176 Mark Millard 2017-07-26 23:27:02 UTC
(In reply to Mark Millard from comment #172 and #173)

Looks like my comments #172 and #173 have
a different interpretation:

Just because the dmidecode reports something
does not mean that FreeBSD supports it (yet
or possibly completely).

In this case the comment #170:

-----------------------------------------------------------------------------
DRAM ECC is not supported or disabled
-----------------------------------------------------------------------------

need not indicate what the board and
BIOS are capable of --nor what the
BIOS might be configured for.

So https://reviews.freebsd.org/D9824?id=25815
currently reports different information than I
thought for the context in question.
Comment 177 Nils Beyer 2017-07-27 18:26:41 UTC
finally, finally, finally a complete poudriere run without any system freezes or unexpected reboots. :-)

Take a look for yourself:

   http://46.245.217.106:10080/build.html?mastername=11_1-default&build=2017-07-26_01h13m51s

26020 packages built within 43 hours; as flawky as the Ryzen may be, but that thing is really fast (still running stock with slow RAM) - comparing to:

    http://beefy9.nyi.freebsd.org/build.html?mastername=110amd64-default&build=439712

I don't know the specs of the FreeBSD "beefy" servers; but I suggest that they'll try a Threadripper when they will be out.

Don, I'm quite certain that you nailed the bug for what I've created that report here for. Thank you very, very much. And if it's really nailed and not a magic moment, then Matthew Dillon was right - all the time. He said that he sent a full test case to AMD in April. It's quite a shame that there is still no reaction from their side. I'm sure, Matthew would have told by now if they really did.

Anyways, I'll start a poudriere run again to see how many of the failed ports can be built then.

Don, is it possible to get your one-liner patch upstream, so that other FreeBSD Ryzen users may profitate from it?
Comment 178 Nils Beyer 2017-07-27 18:59:42 UTC
Ahh yes, got an MCA as well:
-------------------------------------------------------------------------------
Jul 27 10:08:38 asbach kernel: pid 64716 (conftest), uid 0: exited on signal 10
Jul 27 10:09:41 asbach kernel: MCA: Bank 1, Status 0x90200000000b0151
Jul 27 10:09:41 asbach kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
Jul 27 10:09:41 asbach kernel: MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 9
Jul 27 10:09:41 asbach kernel: MCA: CPU 9 COR ICACHE L1 IRD error
Jul 27 10:28:53 asbach kernel: pid 67290 (doxygen), uid 0: exited on signal 6
-------------------------------------------------------------------------------
Comment 179 Mark Millard 2017-07-27 19:05:57 UTC
(In reply to Nils Beyer from comment #177)

Not that I'll be able to experiment with such a
build any time soon but . . .

Could you post material about how to set up
and configure a repeat for others to try your
test?

In my case I'm not a poudriere user (other than
some failed cross-build experiments some time
ago). So it would not be a variation of an
existing poudriere configuration.
Comment 180 Nils Beyer 2017-07-27 19:19:37 UTC
(In reply to Mark Millard from comment #179)

sure, I try:
------------------------------------------------------------------------------
1) fresh FreeBSD 11.1-RELEASE install on ZFS-on-root; poolname=freeze; booted into installed system and logged in as "root"

2) pkg install poudriere-devel
2a) rehash

3) cat > /usr/local/etc/poudriere.conf <<EOF
ZPOOL=freeze
FREEBSD_HOST=ftp://ftp.freebsd.org
RESOLV_CONF=/etc/resolv.conf
BASEFS=/usr/local/poudriere
USE_PORTLINT=no
USE_TMPFS=yes
DISTFILES_CACHE=/usr/ports/distfiles
PARALLEL_JOBS=15
BUILD_AS_NON_ROOT=no
ALLOW_MAKE_JOBS_PACKAGES="pkg ccache py* gcc* llvm* ghc* *webkit* *office* chromium* iridium* mongodb*"
EOF

4) poudriere jail -c -j 11_1 -m ftp -v 11.1-RELEASE

5) poudriere ports -c

6) cd /root

7) /usr/bin/nohup poudriere bulk -j 11_1 -a &
------------------------------------------------------------------------------
Comment 181 Don Lewis freebsd_committer freebsd_triage 2017-07-27 20:06:55 UTC
(In reply to Nils Beyer from comment #177)
See comment #92

The final patch will be a bit more involved.
Comment 182 Nils Beyer 2017-07-28 11:19:55 UTC
In order to track these compilation errors, I did what AMD support requested: cleared CMOS by removing all cables and the battery and set VCORE staticially to 1.36250V

Then I started a new, fresh poudriere run.

And guess what, after 1733 built ports (1 failed - "ghc"), my system paniced:
------------------------------------------------------------------------------
root@asbach:/var/crash/#kgdb -c vmcore.0 /usr/lib/debug/boot/kernel/kernel.debug 
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
spin lock 0xffffffff81dc8b50 (smp rendezvous) held by 0xfffff801325ea560 (tid 102081) too long
timeout stopping cpus
panic: spin lock held too long
cpuid = 6
KDB: stack backtrace:
#0 0xffffffff80aada97 at kdb_backtrace+0x67
#1 0xffffffff80a6bb76 at vpanic+0x186
#2 0xffffffff80a6b9e3 at panic+0x43
#3 0xffffffff80a4cf71 at _mtx_lock_spin_cookie+0x311
#4 0xffffffff81042dc1 at smp_targeted_tlb_shootdown+0x101
#5 0xffffffff81042cac at smp_masked_invltlb+0x4c
#6 0xffffffff80eced91 at pmap_invalidate_all+0x211
#7 0xffffffff80ed936a at pmap_advise+0x49a
#8 0xffffffff80d60c26 at vm_map_madvise+0x2c6
#9 0xffffffff80d6534e at sys_madvise+0x7e
#10 0xffffffff80ee0394 at amd64_syscall+0x6c4
#11 0xffffffff80ec392b at Xfast_syscall+0xfb
Uptime: 4h4m31s
Dumping 5426 out of 32665 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /usr/lib/debug/boot/kernel/zfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/zfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/opensolaris.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/opensolaris.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linprocfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linprocfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linux_common.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linux_common.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/tmpfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/tmpfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/vmm.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/vmm.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/ums.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/ums.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/pflog.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/pflog.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/pf.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/pf.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linux.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linux.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linux64.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linux64.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/nullfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/nullfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/fdescfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/fdescfs.ko.debug
#0  doadump (textdump=<value optimized out>) at pcpu.h:222
222     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:222
#1  0xffffffff80a6b6f1 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a6bbb0 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80a6b9e3 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80a4cf71 in _mtx_lock_spin_cookie (c=<value optimized out>, v=<value optimized out>, tid=18446735289348100096, opts=<value optimized out>, 
    file=<value optimized out>, line=<value optimized out>) at /usr/src/sys/kern/kern_mutex.c:672
#5  0xffffffff81042dc1 in smp_targeted_tlb_shootdown (mask={__bits = 0xfffffe085f03b780}, vector=244, pmap=<value optimized out>, addr1=<value optimized out>, addr2=0)
    at /usr/src/sys/x86/x86/mp_x86.c:1470
#6  0xffffffff81042cac in smp_masked_invltlb (mask={__bits = 0xfffffe085f03b7b0}, pmap=<value optimized out>) at /usr/src/sys/x86/x86/mp_x86.c:1504
#7  0xffffffff80eced91 in pmap_invalidate_all (pmap=0xfffff8017f9ff138) at /usr/src/sys/amd64/amd64/pmap.c:1662
#8  0xffffffff80ed936a in pmap_advise (pmap=<value optimized out>, sva=35436597248, eva=35436597248, advice=5) at /usr/src/sys/amd64/amd64/pmap.c:6189
#9  0xffffffff80d60c26 in vm_map_madvise (map=<value optimized out>, start=35436552192, end=35436597248, behav=<value optimized out>) at /usr/src/sys/vm/vm_map.c:2291
#10 0xffffffff80d6534e in sys_madvise (td=<value optimized out>, uap=<value optimized out>) at /usr/src/sys/vm/vm_mmap.c:705
#11 0xffffffff80ee0394 in amd64_syscall (td=0xfffff802bb419000, traced=0) at subr_syscall.c:135
#12 0xffffffff80ec392b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396
#13 0x00000008020502fa in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
------------------------------------------------------------------------------

I raised the voltage by 0.05V to 1.41250V as suggested by AMD tech support. And will try another fresh poudriere run now.

At least, that panic is something new - is that something caused by flawky CPU or a software bug?
Comment 183 Nils Beyer 2017-07-28 18:09:19 UTC
And it paniced again - same reason as before:
---------------------------------------------------------------------------
root@asbach:/var/crash/#kgdb -c vmcore.1 /usr/lib/debug/boot/kernel/kernel.debug 
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
spin lock 0xffffffff81dc8b50 (smp rendezvous) held by 0xfffff8045bb54000 (tid 101852) too long
timeout stopping cpus
panic: spin lock held too long
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff80aada97 at kdb_backtrace+0x67
#1 0xffffffff80a6bb76 at vpanic+0x186
#2 0xffffffff80a6b9e3 at panic+0x43
#3 0xffffffff80a4cf71 at _mtx_lock_spin_cookie+0x311
#4 0xffffffff81042dc1 at smp_targeted_tlb_shootdown+0x101
#5 0xffffffff81042cac at smp_masked_invltlb+0x4c
#6 0xffffffff80eced91 at pmap_invalidate_all+0x211
#7 0xffffffff80ed2f00 at pmap_protect+0x740
#8 0xffffffff80d608ba at vm_map_protect+0x3fa
#9 0xffffffff80d651fd at sys_mprotect+0x4d
#10 0xffffffff80ee0394 at amd64_syscall+0x6c4
#11 0xffffffff80ec392b at Xfast_syscall+0xfb
Uptime: 6h51m38s
Dumping 5333 out of 32665 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /usr/lib/debug/boot/kernel/zfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/zfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/opensolaris.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/opensolaris.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linprocfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linprocfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linux_common.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linux_common.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/tmpfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/tmpfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/vmm.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/vmm.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/ums.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/ums.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/pflog.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/pflog.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/pf.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/pf.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linux.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linux.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/linux64.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/linux64.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/nullfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/nullfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/fdescfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/fdescfs.ko.debug
#0  doadump (textdump=<value optimized out>) at pcpu.h:222
222     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:222
#1  0xffffffff80a6b6f1 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a6bbb0 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80a6b9e3 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80a4cf71 in _mtx_lock_spin_cookie (c=<value optimized out>, v=<value optimized out>, tid=18446735290631902560, opts=<value optimized out>, file=<value optimized out>, 
    line=<value optimized out>) at /usr/src/sys/kern/kern_mutex.c:672
#5  0xffffffff81042dc1 in smp_targeted_tlb_shootdown (mask={__bits = 0xfffffe085e6ea740}, vector=244, pmap=<value optimized out>, addr1=<value optimized out>, addr2=0)
    at /usr/src/sys/x86/x86/mp_x86.c:1470
#6  0xffffffff81042cac in smp_masked_invltlb (mask={__bits = 0xfffffe085e6ea770}, pmap=<value optimized out>) at /usr/src/sys/x86/x86/mp_x86.c:1504
#7  0xffffffff80eced91 in pmap_invalidate_all (pmap=0xfffff8031ac7f138) at /usr/src/sys/amd64/amd64/pmap.c:1662
#8  0xffffffff80ed2f00 in pmap_protect (pmap=<value optimized out>, sva=<value optimized out>, eva=<value optimized out>, prot=<value optimized out>) at /usr/src/sys/amd64/amd64/pmap.c:4156
#9  0xffffffff80d608ba in vm_map_protect (map=0xfffff8031ac7f000, start=<value optimized out>, end=37457926553600, new_prot=3 '\003', set_max=0) at /usr/src/sys/vm/vm_map.c:2129
#10 0xffffffff80d651fd in sys_mprotect (td=<value optimized out>, uap=<value optimized out>) at /usr/src/sys/vm/vm_mmap.c:603
#11 0xffffffff80ee0394 in amd64_syscall (td=0xfffff80307c6d560, traced=0) at subr_syscall.c:135
#12 0xffffffff80ec392b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396
#13 0x00000008020e89ca in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
---------------------------------------------------------------------------

because I'm not in our company's building right now (weekend), I'll can test the last voltage increase on Monday only...
Comment 184 Mark Millard 2017-07-28 19:15:47 UTC
(In reply to Nils Beyer from comment #183)

For:

#3  0xffffffff80a6b9e3 in panic (fmt=<value optimized out>) at
/usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80a4cf71 in _mtx_lock_spin_cookie (c=<value optimized out>,
v=<value optimized out>, tid=18446735290631902560, opts=<value optimized out>,
file=<value optimized out>, 
   line=<value optimized out>) at /usr/src/sys/kern/kern_mutex.c:672

I looked up the panic call at:

/usr/src/sys/kern/kern_mutex.c:672

It is the one in:

static void
_mtx_lock_spin_failed(struct mtx *m)
{
        struct thread *td;

        td = mtx_owner(m);

        /* If the mutex is unlocked, try again. */
        if (td == NULL)
                return;

        printf( "spin lock %p (%s) held by %p (tid %d) too long\n",
            m, m->lock_object.lo_name, td, td->td_tid);
#ifdef WITNESS
        witness_display_spinlock(&m->lock_object, td, printf);
#endif
        panic("spin lock held too long");
}


So the duration of holding the lock is involved in
hitting this specific panic.


In _mtx_lock_spin_cookie there is:

        for (;;) {
                if (v == MTX_UNOWNED) {
                        if (_mtx_obtain_lock_fetch(m, &v, tid))
                                break;
                        continue;
                }
                /* Give interrupts a chance while we spin. */
                spinlock_exit();
                do {
                        if (lda.spin_cnt < 10000000) {
                                lock_delay(&lda);
                        } else {
                                lda.spin_cnt++;
                                if (lda.spin_cnt < 60000000 || kdb_active ||
                                    panicstr != NULL)
                                        DELAY(1);
                                else
                                        _mtx_lock_spin_failed(m);
                                cpu_spinwait();
                        }
                        v = MTX_READ_VALUE(m);
                } while (v != MTX_UNOWNED);
                spinlock_enter();
        }

So apparently lda.spin_cnt made it to 60000000
or beyond.
Comment 185 Don Lewis freebsd_committer freebsd_triage 2017-07-30 23:02:47 UTC
(In reply to Nils Beyer from comment #177)

The patch is currently being reviewed here:
  https://reviews.freebsd.org/D11780
Comment 186 Nils Beyer 2017-07-31 06:44:46 UTC
(In reply to Nils Beyer from comment #183)

after the last and final voltage increase VCORE -> 1.4250V, the system paniced rather quickly after eight minutes; again because of "spin lock held too long"; I spare the backtrace as it looks exactly like the others before:
--------------------------------------------------------------------------------
root@asbach:/var/crash/#cat info.last
Dump header from device: /dev/ada0p1
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 5584277504
  Blocksize: 512
  Dumptime: Mon Jul 31 08:11:51 2017
  Hostname: asbach.renzel.net
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 11.1-RELEASE #2 r321399M: Mon Jul 24 18:31:19 CEST 2017
    root@asbach.renzel.net:/usr/obj/usr/src/sys/GENERIC
  Panic String: spin lock held too long
  Dump Parity: 632592130
  Bounds: 2
  Dump Status: good
--------------------------------------------------------------------------------

Now I've reset my CMOS to BIOS defaults and try a poudriere run with stock settings again - just as a cross-check...
Comment 187 Nils Beyer 2017-07-31 06:45:52 UTC
(In reply to Mark Millard from comment #184)

is it caused by a software bug or hardware misbehaviour?
Comment 188 Nils Beyer 2017-07-31 06:46:14 UTC
(In reply to Don Lewis from comment #185)

thanks - will try after my cross-test is finished...
Comment 189 Mark Millard 2017-07-31 07:14:38 UTC
(In reply to Nils Beyer from comment #187)

I can not even tell if it is some form of
deadlock vs. livelock or what all else is
holding the lock over time (or even at the
failure).

Someone that knows what they are doing for
use of dump files for amd64 might be able
to learn what what was holding the lock at
the time of the failure. That might start
to give a somewhat better clue about what
was going on.
Comment 190 Ivan Rozhuk 2017-07-31 13:20:34 UTC
    "On AMD, the tool discovered that some processors generate
    a #UD (undefined opcode) exception prior to completing the
    instruction fetch. Per AMD specifications, a #PF (page fault)
    exception occurring during an instruction fetch should
    supersede a #UD exception, but in the instruction search,
    which places the last bytes of the instruction on a non
    -
    executable page, some processors generate the #UD before the
    final bytes are moved off of the read/write page. It appears
    that AMD discovered this at around the same time as this
    research; the newest AMD Architecture Programmer’s
    Manual (March 2017) was updated to allow this situation."
From: https://github.com/xoreaxeaxeax/sandsifter/blob/master/references/domas_breaking_the_x86_isa_wp.pdf
Tool: https://github.com/xoreaxeaxeax/sandsifter
Comment 191 Nils Beyer 2017-07-31 17:07:46 UTC
(In reply to rozhuk.im from comment #190)

very interesting finding; AMD updated its dev docs March 2017, right after the Ryzen launch - sounds like that the OS has to perform some workarounds if the Zen micro-architecure is also affected.

Unfortunately, that tool you linked here doesn't compile under FreeBSD...
Comment 192 Ivan Rozhuk 2017-08-01 00:48:31 UTC
(In reply to Nils Beyer from comment #191)

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221132
You can play with it.
I will finish it later.
Comment 193 Don Lewis freebsd_committer freebsd_triage 2017-08-01 02:01:07 UTC
(In reply to rozhuk.im from comment #190)
I don't see that change in the March 2017 AMD64 Architecture Programmer’s Manual Volume 2: System Programming (revision 3.28), which I found here:
http://support.amd.com/TechDocs/24593.pdf

The only change to the #UD description is the addition of the UD0 and UD1 instructions as potential causes.  I didn't find these in the AMD documentation, but apparently they are reserved opcodes that will generate #UD.

It looks to me like encountering this bug will kill the process with SIGILL, which we aren't seeing.  Working around this looks like it would be ugly ...
Comment 194 Nils Beyer 2017-08-01 07:48:09 UTC
(In reply to Nils Beyer from comment #186)

> Now I've reset my CMOS to BIOS defaults and try a poudriere run with stock settings again - just as a cross-check...

and it kernel panicked whereas it ran two complete poudriere builds with 27000 ports each before all this. *sigh*

Don, I'll try your D11780 Perforce patch and run a poudriere build again.

BTW: AMD tech support offers me an RMA after I told them my results of the voltage setting experiment - so they also do not seem to have any ideas what's going on...
Comment 195 commit-hook freebsd_committer freebsd_triage 2017-08-02 01:44:34 UTC
A commit references this bug:

Author: truckman
Date: Wed Aug  2 01:43:36 UTC 2017
New revision: 321899
URL: https://svnweb.freebsd.org/changeset/base/321899

Log:
  Lower the amd64 shared page, which contains the signal trampoline,
  from the top of user memory to one page lower on machines with the
  Ryzen (AMD Family 17h) CPU.  This pushes ps_strings and the stack
  down by one page as well.  On Ryzen there is some sort of interaction
  between code running at the top of user memory address space and
  interrupts that can cause FreeBSD to either hang or silently reset.
  This sounds similar to the problem found with DragonFly BSD that
  was fixed with this commit:
    https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
  but our signal trampoline location was already lower than the address
  that DragonFly moved their signal trampoline to.  It also does not
  appear to be related to SMT as described here:
    https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498

    "Hi, Matt Dillon here. Yes, I did find what I believe to be a
     hardware issue with Ryzen related to concurrent operations. In a
     nutshell, for any given hyperthread pair, if one hyperthread is
     in a cpu-bound loop of any kind (can be in user mode), and the
     other hyperthread is returning from an interrupt via IRETQ, the
     hyperthread issuing the IRETQ can stall indefinitely until the
     other hyperthread with the cpu-bound loop pauses (aka HLT until
     next interrupt). After this situation occurs, the system appears
     to destabilize. The situation does not occur if the cpu-bound
     loop is on a different core than the core doing the IRETQ. The
     %rip the IRETQ returns to (e.g. userland %rip address) matters a
     *LOT*. The problem occurs more often with high %rip addresses
     such as near the top of the user stack, which is where DragonFly's
     signal trampoline traditionally resides. So a user program taking
     a signal on one thread while another thread is cpu-bound can cause
     this behavior. Changing the location of the signal trampoline
     makes it more difficult to reproduce the problem. I have not
     been because the able to completely mitigate it. When a cpu-thread
     stalls in this manner it appears to stall INSIDE the microcode
     for IRETQ. It doesn't make it to the return pc, and the cpu thread
     cannot take any IPIs or other hardware interrupts while in this
     state."
  since the system instability has been observed on FreeBSD with SMT
  disabled.  Interrupts to appear to play a factor since running a
  signal-intensive process on the first CPU core, which handles most
  of the interrupts on my machine, is far more likely to trigger the
  problem than running such a process on any other core.

  Also lower sv_maxuser to prevent a malicious user from using mmap()
  to load and execute code in the top page of user memory that was made
  available when the shared page was moved down.

  Make the same changes to the 64-bit Linux emulator.

  PR:		219399
  Reported by:	nbe@renzel.net
  Reviewed by:	kib
  Reviewed by:	dchagin (previous version)
  Tested by:	nbe@renzel.net (earlier version)
  MFC after:	2 weeks
  Differential Revision:	https://reviews.freebsd.org/D11780

Changes:
  head/sys/amd64/amd64/elf_machdep.c
  head/sys/amd64/amd64/initcpu.c
  head/sys/amd64/include/md_var.h
  head/sys/amd64/linux/linux_sysvec.c
Comment 196 Nils Beyer 2017-08-02 12:27:45 UTC
(In reply to commit-hook from comment #195)

thanks for the commit; I can confirm that there are no system freezes/spontaneous reboots (without kernel panics) any more. Normal kernel panics (with crash dumps if available) still apear though...
Comment 197 Nils Beyer 2017-08-02 12:31:11 UTC
As a last act of desperation, I've OCed my system:

    - CPU freq: 3000MHz -> 3700MHz
    - VCORE AUTO: -> 1.3625V
    - DDR4 freq: 2133MHz-> 2400MHz

and disabled "OPCache Control" in AMD CBS section of my BIOS.

Let's see how it goes...
Comment 198 Don Lewis freebsd_committer freebsd_triage 2017-08-02 16:42:49 UTC
(In reply to rozhuk.im from comment #190)
The slide deck here:
  https://github.com/xoreaxeaxeax/sandsifter/blob/master/references/domas_breaking_the_x86_isa.pdf
is pretty informative.  It turns out that this problem affects the Geode.  The difference in behavior is mentioned in Table 8-8 of the document that I previously sited.

I think what is happening is that is that in the case of invalid instructions, the hardware still does a preliminary determination of their length to determine how many bytes to fetch.  If a page fault happens while fetching the remaining bytes, then a page fault exception is supposed to happen, but in this case, the hardware has already decided that the instruction is invalid and raises an undefined instruction exception instead.

It looks to me like the only real damage is that this breaks the algorithm that sandsifter uses to determine instruction lengths.  It doesn't look like it causes valid instructions to be flagged as invalid if they can't be fetched without causing a page fault.
Comment 199 Nils Beyer 2017-08-02 19:44:59 UTC
(In reply to Nils Beyer from comment #197)

it's still running; as a side note: I cannot tell whether the overclocking is really noticeable regarding compilation performance - I think I'll have to wait until the complete run is finished. At least, I can notice in the sensor readings that something has changed:
-------------------------------------------------------------------------
================================================================
MBTEMP:   40
CPUTEMP:  66
AUXTEMP0: 14
AUXTEMP1: 29
AUXTEMP2: 23
AUXTEMP3: 231
----------------------------------------------------------------
VCORE:    .656000
+3.3V:    3.328000
VSOC:     .936000
----------------------------------------------------------------
SYSFAN:   11529
CPUFAN:   4356
AUXFAN0:  65311
AUXFAN1:  65311
AUXFAN2:  65311
================================================================
-------------------------------------------------------------------------
Comment 200 Nils Beyer 2017-08-02 19:52:30 UTC
(In reply to Don Lewis from comment #198)

so, we're on the wrong track with "sandsifter" in order to find out what's buggy in the CPU and how to possibly circumvent it, correct?
Comment 201 Nils Beyer 2017-08-02 19:59:06 UTC
(In reply to Nils Beyer from comment #200)

maybe the artificial intelligence inside the Ryzen is freaking out... I don't know:

    http://www.bbspot.com/Images/Comics/2008/2008-02-06_pcw.jpg
Comment 202 Don Lewis freebsd_committer freebsd_triage 2017-08-03 04:46:20 UTC
(In reply to Nils Beyer from comment #200)
I believe so.  It's pretty unlikely that the problem is caused by undefined opcodes, and we are not seeing any evidence (SIGILL) of valid instructions being trapped as invalid because they experience page faults mid-fetch.

BTW, using either my origin workaround patch, or the committed version if the sv_maxuser adjustment is commented out, it is possible to use a user process to mmap() the top page of user memory, load some code up there, and execute it for testing purposes.  I've done some experiments with that and it is possible to quickly hang the machine or cause it to reboot.  The interesting thing is that I haven't observed any ill effects as long as no instructions are executed above 0x7fffffffff40.  That's sort of in the area mentioned in the Dragonfly fix, but even they saw issues at addresses lower than that and a decreasing rate as the address was lowered.  Our signal trampoline code was much closer to the bottom of the page at 0x7ffffffff000, so at this point I don't know why we were having problems.  The only thing that I can think of is that the signal trampoline code uses some unusual instructions like syscall and hlt, which are unlike the more vanilla instructions that I was using in my experiments.
Comment 203 Nils Beyer 2017-08-03 17:19:51 UTC
(In reply to Don Lewis from comment #202)

> BTW, using either my origin workaround patch, or the committed version if the sv_maxuser adjustment is commented out, it is possible to use a user process to mmap() the top page of user memory,
> load some code up there, and execute it for testing purposes.  I've done some experiments with that and it is possible to quickly hang the machine or cause it to reboot.

that sounds interesting; do you have source code for that available that I can try here on my systems, please?
Comment 204 Don Lewis freebsd_committer freebsd_triage 2017-08-04 16:19:32 UTC
Created attachment 185022 [details]
program to cause Ryzen hang/reboot on tweaked FreeBSD by executing code in high memory

If you modify the FreeBSD kernel to lower the shared page, but leave sv_maxuser at its original value so that a user program can mmap the page at 0x7ffffffff000, the  attached program will fill that page with RET instructions and perform calls to those in a loop.  I have not observed any issues with calls to the RET instructions at 0x7fffffffff3f or below.  When the RET instruction at 0x7fffffffff40 is executed, my machine will typically silently reboot without a panic message, or it will sometimes hang with the screen blanked.  This test is the most sensitive on core 0, which handles most interrupts, so run under "cpuset -l 0" for best results.  The problem appears to be triggered when the RET instruction is interrupted.

It is possible to load and execute arbitrary code for other experiments.
Comment 205 Nils Beyer 2017-08-05 01:21:05 UTC
(In reply to Don Lewis from comment #204)

absolutely wonderful test case you've created - despite your instructions I've run that program on a vanilla 11.1-RELEASE (no patches) on my second Ryzen system (the one with 16GB where I've never poudriered), and it freezes exactly there where you have experienced it (done it via SSH, so output is still visible):
-------------------------------------------------------------------------------
[...]
executing at 0x7fffffffff30 ..........
executing at 0x7fffffffff31 ..........
executing at 0x7fffffffff32 ..........
executing at 0x7fffffffff33 ..........
executing at 0x7fffffffff34 ..........
executing at 0x7fffffffff35 ..........
executing at 0x7fffffffff36 ..........
executing at 0x7fffffffff37 ..........
executing at 0x7fffffffff38 ..........
executing at 0x7fffffffff39 ..........
executing at 0x7fffffffff3a ..........
executing at 0x7fffffffff3b ..........
executing at 0x7fffffffff3c ..........
executing at 0x7fffffffff3d ..........
executing at 0x7fffffffff3e ..........
executing at 0x7fffffffff3f ..........
executing at 0x7fffffffff40 ...
-------------------------------------------------------------------------------

tried that on my Intel Xeon E3-1220 v3 system, but had to use the second "WHERE"-define because it couldn't mmap at 0x7f(...) - ENOMEM. Anyways, the result is that it didn't freeze there...
Comment 206 Nils Beyer 2017-08-05 01:54:46 UTC
(In reply to Nils Beyer from comment #205)

sorry, mass confusion in my brain; I accidentially executed that program on my first Ryzen system with 32GB of RAM and your origin patch applied. The SSH session to my second Ryzen system timed out, and because I do SSH there via SSH to my first Ryzen system I was therefore thrown back to my first system which I didn't notice. :-(

Sorry for the confusion, but the result stays the same. Freeze at the suspicious location.

At the moment, I really try to get the same freeze on my second Ryzen system (using another SSH hop) with 16GB of RAM. There, I had to use the second "WHERE"-define as well. And there's vanilla 11.1-RELEASE indeed running, so I assume that this freeze won't happen because I need your origin patch, right?
Comment 207 Nils Beyer 2017-08-05 02:14:14 UTC
(In reply to Nils Beyer from comment #206)

answering myself: yes, maybe I should just listen to what Don said: I do need that origin patch because without it the programm is not able to MMAP at 0x7f(...) - as Don instructed. And at 0x4f(...) there seems to be no problem.

Using "cpuset -l 0 ./ryzen_provoke_crash" it freezes exactly where my first system did:
--------------------------------------------------------------------------------
[...]
executing at 0x7fffffffff3a ..........
executing at 0x7fffffffff3b ..........
executing at 0x7fffffffff3c ..........
executing at 0x7fffffffff3d ..........
executing at 0x7fffffffff3e ..........
executing at 0x7fffffffff3f ..........
executing at 0x7fffffffff40 ......
--------------------------------------------------------------------------------

Without using "cpuset -l 0" the program seems to pass the 0x7fffffffff40 mark, but becomes very, very slow - don't know if it freezes because I aborted it; took too long for one row...
Comment 208 Don Lewis freebsd_committer freebsd_triage 2017-08-05 06:46:02 UTC
The second WHERE 0x04fffffff000 is someleftover debug stuff from an earlier version of the program that executed some more complicated code.  I needed to debug that code in a harmless spot in memory so that I could get that code working right.

Even without cpuset, I think I eventually got it to crash at 0x7fffffffff40, probably because it migrated to CPU 0 on it's own or an interrupt finally caught it at that address, which would be less frequent on the other cores.  There might have been other stuff running on the system at the same time.

If you pin it to some other CPU, do you see system time spike up when it gets to 0x7fffffffff40?  I wonder if it's getting kicked into a trap handler on every iteration when it gets to that address.  That and an interrupt happening at the same time might sent it off into the weeds.

It would also be interesting to see the results on non-Ryzen hardware.
Comment 209 Nils Beyer 2017-08-05 12:32:47 UTC
(In reply to Don Lewis from comment #208)

> It would also be interesting to see the results on non-Ryzen hardware.

done that on the mentioned Intel Xeon system. 11.1-RELEASE plus the single-line patch in "vmparam.h":
-------------------------------------------------------------------------------
#define SHAREDPAGE (VM_MAXUSER_ADDRESS - 2*PAGE_SIZE)
-------------------------------------------------------------------------------

Your program runs through without any problems or slowdowns at/over "0x7fffffffff40" - with as well without pinning it to core 0...
Comment 210 Nils Beyer 2017-08-06 22:05:11 UTC
(In reply to Don Lewis from comment #204)

Don, is it possible to modify your program so that it triggers that freeze on a vanilla 11.1-RELEASE?
Comment 211 Don Lewis freebsd_committer freebsd_triage 2017-08-06 23:18:32 UTC
(In reply to Nils Beyer from comment #210)
No.  The kernel needs to be modified to move the shared page out of the way so that the program can mmap() the top page of user memory.

With an unmodified kernel, it should be possible to trigger the problem by writing a program that installs a signal handler for some signal and then sending itself signals by calling kill() in a loop.
Comment 212 Nils Beyer 2017-08-07 10:41:51 UTC
(In reply to Don Lewis from comment #204)

following claim came from a Linux user regarding freezes near the top memory limit:
-------------------------------------------------------------------------------
Thats looking more BSD specific as linux does have a full guard page as oppose to a partial in BSD. Does this mitigate a CPU issue? probably but it isn't a new one & the CPU/RAM controller shouldn't be susceptable. But this side of things may have already been sorted in linux land. 
-------------------------------------------------------------------------------

is there something to it?
Comment 213 Nils Beyer 2017-08-07 17:28:59 UTC
(In reply to Nils Beyer from comment #212)

more statements:
--------------------------------------------------------------------------------
I'm new to this myself (I work on the GPU SW side) but AFAICS there are at least three different CPU families (1 from AMD) over the last decade which require special treatment, basically making sure that no code gets executed near the end of canonical user address space. The top of user process address space is the dividing line between the least privileged code and the touch-it-and-die non-canonical address space.

Over time it seems that more "safe area" is required - presumably because each new CPU generation pre-fetches further ahead than the last one. In a sense Linux (and Windows I believe) got lucky by reserving a full guard page while BSD allocated a smaller guard area. As a result BSD has had to bump the guard area (to a full page) while other OSes did not.
--------------------------------------------------------------------------------
Comment 214 Don Lewis freebsd_committer freebsd_triage 2017-08-07 22:11:33 UTC
I'm not familiar enough with the Linux memory map to comment, but I haven't seen any reports of hang/reboot problems that sound like what what we ran into.  There have been reports of idle machines having problems that seem to be resolved by avoiding the deep Cx states.  I don't know if that is a potential problem for us since our default seems to be:
  hw.acpi.cpu.cx_lowest: C1

It would be nice if the "safe area" was documented.  This doesn't appear to be an issue with Intel CPUs.  My AMD FX-8320E is happy with the pre-Ryzen location of our signal trampoline code.  I haven't run my test program on it because I'm keeping it busy doing the port builds that I'd hoped to be doing on my new Ryzen box.
Comment 215 Nils Beyer 2017-08-07 22:26:11 UTC
(In reply to Don Lewis from comment #214)

documentation is the keyword, I don't know; I only have this:
----------------------------------------------------------------------------
from: linux/v4.12/source/arch/x86/include/asm/processor.h - lines 791 and seq
----------------------------------------------------------------------------
#ifdef CONFIG_X86_32
[...]
#else
/*
 * User space process size. 47bits minus one guard page.  The guard
 * page is necessary on Intel CPUs: if a SYSCALL instruction is at
 * the highest possible canonical userspace address, then that
 * syscall will enter the kernel with a non-canonical return
 * address, and SYSRET will explode dangerously.  We avoid this
 * particular problem by preventing anything from being mapped
 * at the maximum canonical address.
 */
#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
----------------------------------------------------------------------------

do we have or need something similar?
Comment 216 Don Lewis freebsd_committer freebsd_triage 2017-08-07 23:23:43 UTC
LOL ... 

Prior to the fix in r321899, the top page of user memory for amd64 executables was used by the shared page, the contents of which are controlled by the kernel.  This page does contain the signal trampoline, which contains a SYSCALL instruction, which made me very suspicious based on my experiments with executing code in this page.  The SYSCALL instruction is located well away from the top of the page, though.  I may try playing with this instruction if I ever have the time.

After r321899, the shared page is moved lower and we don't allow the top page to be used at all, similar to Linux.  CloudABI64 got a similar fix.
Comment 217 Mark Millard 2017-08-08 00:21:25 UTC
(In reply to Nils Beyer from comment #215)

Looking around the comment is newer than the definition
but the comment was added on: 2014-Nov-4. (Based on
"blame".)

By contrast the line:

#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)

dates back to 2009-Feb-20.

Definitely not a new issue.

The placement in the files is in blame for:

https://github.com/torvalds/linux/blame/master/arch/x86/include/asm/processor.h
Comment 218 Konstantin Belousov freebsd_committer freebsd_triage 2017-08-08 08:10:20 UTC
(In reply to Don Lewis from comment #216)
This is a reference to FreeBSD-SA-12:04.sysret.  We handle that by forcing normal iret return path instead of the fast syscall return, same as we handle any modifications to the context from usermode.  See the very end of the amd64_syscall().
Comment 219 commit-hook freebsd_committer freebsd_triage 2017-08-16 08:01:03 UTC
A commit references this bug:

Author: truckman
Date: Wed Aug 16 07:59:58 UTC 2017
New revision: 322569
URL: https://svnweb.freebsd.org/changeset/base/322569

Log:
  MFC r321899

  Lower the amd64 shared page, which contains the signal trampoline,
  from the top of user memory to one page lower on machines with the
  Ryzen (AMD Family 17h) CPU.  This pushes ps_strings and the stack
  down by one page as well.  On Ryzen there is some sort of interaction
  between code running at the top of user memory address space and
  interrupts that can cause FreeBSD to either hang or silently reset.
  This sounds similar to the problem found with DragonFly BSD that
  was fixed with this commit:
    https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
  but our signal trampoline location was already lower than the address
  that DragonFly moved their signal trampoline to.  It also does not
  appear to be related to SMT as described here:
    https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498

    "Hi, Matt Dillon here. Yes, I did find what I believe to be a
     hardware issue with Ryzen related to concurrent operations. In a
     nutshell, for any given hyperthread pair, if one hyperthread is
     in a cpu-bound loop of any kind (can be in user mode), and the
     other hyperthread is returning from an interrupt via IRETQ, the
     hyperthread issuing the IRETQ can stall indefinitely until the
     other hyperthread with the cpu-bound loop pauses (aka HLT until
     next interrupt). After this situation occurs, the system appears
     to destabilize. The situation does not occur if the cpu-bound
     loop is on a different core than the core doing the IRETQ. The
     %rip the IRETQ returns to (e.g. userland %rip address) matters a
     *LOT*. The problem occurs more often with high %rip addresses
     such as near the top of the user stack, which is where DragonFly's
     signal trampoline traditionally resides. So a user program taking
     a signal on one thread while another thread is cpu-bound can cause
     this behavior. Changing the location of the signal trampoline
     makes it more difficult to reproduce the problem. I have not
     been because the able to completely mitigate it. When a cpu-thread
     stalls in this manner it appears to stall INSIDE the microcode
     for IRETQ. It doesn't make it to the return pc, and the cpu thread
     cannot take any IPIs or other hardware interrupts while in this
     state."
  since the system instability has been observed on FreeBSD with SMT
  disabled.  Interrupts to appear to play a factor since running a
  signal-intensive process on the first CPU core, which handles most
  of the interrupts on my machine, is far more likely to trigger the
  problem than running such a process on any other core.

  Also lower sv_maxuser to prevent a malicious user from using mmap()
  to load and execute code in the top page of user memory that was made
  available when the shared page was moved down.

  Make the same changes to the 64-bit Linux emulator.

  PR:		219399
  Reported by:	nbe@renzel.net
  Reviewed by:	kib
  Reviewed by:	dchagin (previous version)
  Tested by:	nbe@renzel.net (earlier version)
  Differential Revision:	https://reviews.freebsd.org/D11780

Changes:
_U  stable/11/
  stable/11/sys/amd64/amd64/elf_machdep.c
  stable/11/sys/amd64/amd64/initcpu.c
  stable/11/sys/amd64/include/md_var.h
  stable/11/sys/amd64/linux/linux_sysvec.c
Comment 220 Ivan Rozhuk 2017-08-23 14:08:24 UTC
The is some statistic that most problems happen with CPU manufactured before week 26:
https://www.reddit.com/r/Amd/comments/6ubmd1/ryzen_compilation_segfaults_positive_rma/
https://docs.google.com/spreadsheets/d/1pp6SKqvERxBKJupIVTp2_FMNRYmtxgP14ZekMReQVM4/htmlview

Where: UA1725SUS: 17=2017 year, 25=week.
Comment 221 Ivan Rozhuk 2017-09-04 00:52:04 UTC
Offtop: can some one with Ryzen test amdtemp: https://reviews.freebsd.org/D9759 ?
Comment 222 Don Lewis freebsd_committer freebsd_triage 2017-09-04 01:23:18 UTC
(In reply to rozhuk.im from comment #221)
I have, works great.

It doesn't take into account the 20C offset on the 1x00X parts, so you'll have to manually set the offset or remember to do mental math.

After subtracting the offset, I can see that my 1700X is running just above 60C under load.
Comment 223 Nils Beyer 2017-09-04 07:58:21 UTC
Back from vacation and got my RMA CPU dated "UA 1730SUS" (end of July). Will run poudriere again...
Comment 224 Ivan Rozhuk 2017-09-04 11:49:01 UTC
I get money back for my Ryzen, and now looking for >1725 in local markets.

I try with A10-9700 on Taichi and get reboot on rsync upload 100gb file, just like with ryzen.

A10-9700 on ASRock AB350M Pro4 and on ASRock Fatal1ty X370 Gaming X - ok, but once on AB350M Pro4 system freeze (until I press reset) after rsync upload test on firefox (with many tabs) exit. I cant reproduce this.

mpv youtube + game - ok on both mobos. (Ryzen + taichi = freeze/reboot)
Also I got no strange wine apps crashes with this CPU/MoBos.

Now Im not sure is problem was in my Ryzen.
Comment 225 Nils Beyer 2017-09-05 16:33:26 UTC
(In reply to Nils Beyer from comment #223)

well, it froze again - black screen, fans still running. But, because I cannot remember whether I had that user page shift patch still active or not, I've upgraded to 12-CURRENT and restarted the poudriere build from scratch.

The positive thing is that the new CPU is 5°C cooler than my old one - so that's something... ;-)
Comment 226 Nils Beyer 2017-09-05 16:37:39 UTC
(In reply to rozhuk.im from comment #224)

that's very strange. You're running AGESA 1.0.0.6b on the Fatal1ty, right? Do you run that on the Taichi, too? The AB350M Pro4 and the AB350 Pro4 still don't have AGESA 1.0.0.6b available, so maybe that makes a difference; I don't know.

There still is no official Check-My-Ryzen-Tool from AMD, so it's all speculations...
Comment 227 Ivan Rozhuk 2017-09-06 09:28:34 UTC
(In reply to Nils Beyer from comment #226)

Now I use Bristol Ridge, it not affected by AGESA, IMHO.
Comment 228 Nils Beyer 2017-09-07 06:32:21 UTC
(In reply to Nils Beyer from comment #225)

Froze again - black screen, fans still running...
Comment 229 Don Lewis freebsd_committer freebsd_triage 2017-09-07 06:38:17 UTC
(In reply to Nils Beyer from comment #228)
That really sounds like the share page problem ...

Do you know what was building at the time?
Comment 230 Don Lewis freebsd_committer freebsd_triage 2017-09-07 06:42:42 UTC
Created attachment 186143 [details]
program to print the address of the signal trampoline

Compile and run this program.  What is its output?
Comment 231 Nils Beyer 2017-09-07 06:45:41 UTC
(In reply to Don Lewis from comment #229)

can only guess:
-------------------------------------------------------------------------------
root@asbach:/usr/local/poudriere/data/logs/bulk/11_1-default/2017-09-05_16h59m32s/logs/#ls -larTt | tail -10^M
-rw-r--r--  3 root  wheel       7565 Sep  7 03:48:40 2017 wordplay-7.22_1.log
-rw-r--r--  3 root  wheel       6285 Sep  7 03:48:40 2017 powder-115_3.log
-rw-r--r--  3 root  wheel       6173 Sep  7 03:48:40 2017 amtterm-1.4.log
-rw-r--r--  3 root  wheel       7777 Sep  7 03:48:40 2017 rubygem-fpm-1.9.2.log
drwxr-xr-x  2 root  wheel       8055 Sep  7 03:48:40 2017 errors
-rw-r--r--  3 root  wheel       7754 Sep  7 03:48:41 2017 ru-stardict-mueller7-2.4.2.log
-rw-r--r--  3 root  wheel  361443781 Sep  7 03:48:41 2017 copperspice-1.3.2_3.log
drwxr-xr-x  3 root  wheel      25649 Sep  7 03:48:41 2017 .
-rw-r--r--  3 root  wheel         22 Sep  7 03:48:41 2017 xwit-3.4_3.log
-rw-r--r--  3 root  wheel       4821 Sep  7 03:48:41 2017 zen-cart-1.3.9h_2.log
-------------------------------------------------------------------------------
Comment 232 Nils Beyer 2017-09-07 06:46:42 UTC
(In reply to Don Lewis from comment #230)

--------------------------------------------------------------------------
root@asbach:/home/nbe/#./sigtramp^M
8
-1 12 0x7fffffffe190 8
--------------------------------------------------------------------------
Comment 233 Don Lewis freebsd_committer freebsd_triage 2017-09-07 06:59:01 UTC
(In reply to Nils Beyer from comment #232)
Yeah that should be ok.  I get this:
8
-1 12 0x7fffffffe000 8

As long as the third value ends with fexxx then you have the shared page
fix.   There is some strange reason that it took me a long time to
figure out that causes the least significant bits to change, maybe
whether linux emulation is loaded ...
Comment 234 Nils Beyer 2017-09-07 07:07:10 UTC
(In reply to Don Lewis from comment #233)

"linux" and "linux64" modules are indeed loaded:
--------------------------------------------------------------------------------
root@asbach:/root/#kldstat
Id Refs Address            Size     Name
 1   35 0xffffffff80200000 1f49f00  kernel
 2    1 0xffffffff8214b000 31ca00   zfs.ko
 3    2 0xffffffff82468000 cc20     opensolaris.ko
 4    1 0xffffffff82b11000 ac15     linprocfs.ko
 5    1 0xffffffff82b1c000 7bb1     linux_common.ko
 6    1 0xffffffff82b24000 ba42     tmpfs.ko
 7    1 0xffffffff82b30000 3653     ums.ko
 8    1 0xffffffff82b34000 2a22     pflog.ko
 9    1 0xffffffff82b37000 3551d    pf.ko
12    1 0xffffffff82bef000 66b8     nullfs.ko
13    1 0xffffffff82bf6000 5c6d     fdescfs.ko
14    1 0xffffffff82b6d000 1d7c     amdtemp.ko
15    1 0xffffffff82b6f000 ed0      amdsmn.ko
--------------------------------------------------------------------------------

So, should I try without them?

There's still no AGESA 1.0.0.6b for my board. Shall I wait until that?

Or should I contact AMD support again that the replacement doesn't work either?
Comment 235 Don Lewis freebsd_committer freebsd_triage 2017-09-07 07:15:33 UTC
You'll need the linux kmods for doing a full poudriere run I believe.  The lower bits really don't matter.  The important thing is staying below 0x7ffffffff000.

I think the only thing that AGESA 1.0.0.6b buys you is better RAM compatibility and speed.

Yeah, I'd get in touch with AMD.  I suspect that you'll have to go through some hardware troubleshooting again.
Comment 236 Nils Beyer 2017-09-07 08:26:50 UTC
(In reply to Don Lewis from comment #235)

> Yeah, I'd get in touch with AMD.  I suspect that you'll have to go through some hardware troubleshooting again.

I haven't gotten any MCA messages yet with the new CPU - so it makes me believe that it's more a software problem now.

I'd rather bother AMD support again only if I'm strongly sure that there really is a hardware problem - and the MCA messages are one strong indicator for me.

Maybe the user page needs to shifted even further down:

    https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/11ba7f73d6e534d54da55d5c4a1ac1553cc62b45

How do I do that on FreeBSD?
Comment 237 Don Lewis freebsd_committer freebsd_triage 2017-09-07 08:37:53 UTC
(In reply to Nils Beyer from comment #236)
We now leave the boundary page unmapped, which is supposed to cure the problem according to what AMD says.  That's why the "program to cause Ryzen hang/reboot on tweaked FreeBSD by executing code in high memory" is no longer to map that page and use it for evil purposes.
Comment 238 Nils Beyer 2017-09-07 08:59:48 UTC
(In reply to Don Lewis from comment #237)

okay, just for fun I increased the dead zone of the user page:
--------------------------------------------------------------------------
Index: sys/amd64/amd64/elf_machdep.c                                                                                                                                                
===================================================================                                                                                                                 
--- sys/amd64/amd64/elf_machdep.c       (revision 323186)                                                                                                                           
+++ sys/amd64/amd64/elf_machdep.c       (working copy)                                                                                                                              
@@ -88,10 +88,10 @@                                                                                                                                                                 
 amd64_lower_shared_page(struct sysentvec *sv)                                                                                                                                      
 {                                                                                                                                                                                  
        if (hw_lower_amd64_sharedpage != 0) {                                                                                                                                       
-               sv->sv_maxuser -= PAGE_SIZE;                                                                                                                                        
-               sv->sv_shared_page_base -= PAGE_SIZE;                                                                                                                               
-               sv->sv_usrstack -= PAGE_SIZE;                                                                                                                                       
-               sv->sv_psstrings -= PAGE_SIZE;                                                                                                                                      
+               sv->sv_maxuser -= (128 * PAGE_SIZE);                                                                                                                                
+               sv->sv_shared_page_base -= (128 * PAGE_SIZE);                                                                                                                       
+               sv->sv_usrstack -= (128 * PAGE_SIZE);                                                                                                                               
+               sv->sv_psstrings -= (128 * PAGE_SIZE);                                                                                                                              
        }                                                                                                                                                                           
 }
[...]
root@asbach:/home/nbe/#./sigtramp ^M
8
-1 12 0x7ffffff7f190 8
--------------------------------------------------------------------------

let's see if that changes anything...
Comment 239 Ivan Rozhuk 2017-10-03 02:25:11 UTC
Any news?
Is ryzen work ok now?
Comment 240 Nils Beyer 2017-10-03 12:50:04 UTC
(In reply to rozhuk.im from comment #239)

nope, still freezes with black screen and fans still running. Even without having poudriere builds running. Haven't contacted AMD tech support about that yet because I haven't gotten any MCA messages yet, so it probably is not a hardware issue anymore.

AGESA 1.0.0.6b doesn't help at all though.

Haven't played around with voltages/frequencies yet...
Comment 241 Ivan Rozhuk 2017-10-03 13:58:22 UTC
(In reply to Nils Beyer from comment #240)

Is your CPU newer than 25 week?
Comment 242 Nils Beyer 2017-10-03 14:20:54 UTC
(In reply to rozhuk.im from comment #241)

"UA 1730SUS"
Comment 243 Don Lewis freebsd_committer freebsd_triage 2017-10-03 17:19:44 UTC
The shared page relocation (r321899 in HEAD) fixed the hanging/crashing problem for me, but I still had the random SIGBUS/SIGSEGV problem when running parallel compiles.  I RMAed my old CPU the Friday before last.  The replacement with date code 1733SUS was delivered yesterday.  I hope to install it laster today.
Comment 244 Don Lewis freebsd_committer freebsd_triage 2017-10-03 17:26:08 UTC
My original CPU had a 1708SUT date code.
Comment 245 Ivan Rozhuk 2017-10-03 19:23:14 UTC
What is your chipset date?

See after: 17A2.
Example: http://www.pcdiy.com.tw/assets/images/768/8a89458610…8c293db902ab.jpeg

Mine:
X370 Taichi - 1629 - Freeze/reboot
AB350M Pro4 - 1649 - OK
Fatal1ty X370 Gaming X - 1646 - OK
Comment 246 Don Lewis freebsd_committer freebsd_triage 2017-10-03 20:55:32 UTC
(In reply to Nils Beyer from comment #240)
Does the CPU core temperature look OK?
Comment 247 Nils Beyer 2017-10-05 10:28:59 UTC
(In reply to rozhuk.im from comment #245)

> What is your chipset date?

no idea.


> See after: 17A2.

what does "17A2" mean?


> Example: http://www.pcdiy.com.tw/assets/images/768/8a89458610…8c293db902ab.jpeg

link is broken/incomplete...
Comment 248 Nils Beyer 2017-10-05 10:30:18 UTC
(In reply to Don Lewis from comment #246)

yes, temperature is around 51°C - 52°C under load. A couple of degrees cooler than my old CPU...
Comment 249 Ivan Rozhuk 2017-10-05 11:06:55 UTC
(In reply to Nils Beyer from comment #247)

http://forum.ixbt.com/post.cgi?id=attach:9:68823:5110:1.jpg
(on photo my asrock taichi, and some sort of oil from thermal pad)

17A2 - some constant, date on next line.
Comment 250 SF 2017-10-06 10:02:04 UTC
While you were on the search for a softwarefault i found out was is causing it months before and did finally find a solution. You didn't even try it, noone did and i told it much more then one time.

It is caused by the power-supply of the mainboard! My pc is not crashing anymore, all i did is cpu-load-line balancing to low, turn off the boost function, optimized power phase control, cpu switching frequency to 350, disable epu, +0,12 cpuvoltage, +0,12 socvoltage and not cranking up the ram-frequency too much. If its still unstable you have to reduce your cpu-frequency down to 3000mhz or 2800mhz, iam having absolutely no crashes anymore. The solution is very complicated but it works and its exactly what i sayed from beginning...
Comment 251 Ivan Rozhuk 2017-10-06 14:59:38 UTC
Well, I try all that you describe:
- change power-supply
- decrease freq
- turn off boost
- tune LLC
- many other thing in different combinations

But I found that after I remove oil from thermal pad around chipset - fix strange reboots that happen even with BristolRidge, and not happen on other my am4 mobos with BristolRidge.
http://forum.asrock.com/forum_posts.asp?TID=4593&PID=37459&title=x370-taichi-goes-blank-code-00#37459

I remember about some strange work PCI-E on 1700x, but I RMA it and now with BristolRidge and 1300x (1725) I dont see these issues.
Im not sure is it was because oil around chipset or because something wrong with 1700x.
Comment 252 Don Lewis freebsd_committer freebsd_triage 2017-10-07 19:27:08 UTC
(In reply to SF from comment #250)
This mega-thread https://community.amd.com/thread/215773?start=0&tstart=0 on AMD Community Forum is full of Linux users who are experiencing random segfaults when doing parallel compiles.  Lots of experiments with different voltage settings, RAM timing settings, and tweaking of other BIOS knobs.

AMD eventually admitted that there is a "performance marginality" issue and has been doing warranty replacements for customers who run into this problem.  Sometimes they request that the customer perform some experiments with various voltage and other settings before approving the replacement, but I don't recall seeing any success stories from that.

AMD was apparently manually screening some of the replacement CPUs before shipping them, as evidenced by one of the seals on the replacement CPU being cut and traces of thermal compound on the CPU.  At least in some cases AMD performed testing with hardware identical to the the customer's.

The system crashes and hangs that I and many other FreeBSD users was caused by the behavior of the instruction prefetch hardware near the maximum possible user address 0x7fffffffffff.  This problem affected both FreeBSD and DragonflyBSD.  I don't know about the other BSDs.  We implemented an acceptable workaround in r321899.
Comment 253 Don Lewis freebsd_committer freebsd_triage 2017-10-07 19:32:28 UTC
(In reply to Nils Beyer from comment #248)
I finally had a chance to install my new CPU.  Temperatures under load are generally in the upper 40s as opposed to the lower 60s, but the weather has changed and maybe 8C of the difference is due to room temperature.

No crashes/hangs/panics, but I do still have some port build anomalies that I'll discuss in the other PR.
Comment 254 SF 2017-10-08 08:59:17 UTC
Buy a better cooler, ensure your system stays below 60°C under all circumstances. Watercooling should completely negate all your misbehavior like previously sayed, everything i found out was correct. Yesterday i had a single crash but it was almost gone, i think it was caused because of the +voltage settings which are unnecessary. All settings i did are done to make sure that the mainboard isn't overheating.
Comment 255 Mark Millard 2017-10-08 10:44:42 UTC
(In reply to SF from comment #254)

Just reporting a distinct example. . .

The Ryzen 1800X system that I had access to
for a time had water-based cooling that kept
it under 47 degC even during hot summer days
with no air conditioning but doing builds.
(Clearly not true of any days getting near
that temperature in the first place. But it
gives a clue as to effectiveness.) The
power supply was also good. There was only
a basic, low power video card present.

That did not stop the SIGSEGV's, missing
temporary files, having lang/ghc almost
always fail to build, etc. Although, in
general, I seemed to get them somewhat
less often than some folks seemed to be
reporting --other than lang/ghc nearly
always failing fairly rapidly.

[Side question: Have you tested your
ability to repeatedly rebuild lang/ghc
from scratch? It was by far the most
reliable failure that I found (that
was a normal activity).]

But I never observed the panics in normal
operation, not even before the page-avoidance
change was made. However, I was using 64-bit
FreeBSD via Hyper-V with Windows 10 Pro as the
boot OS. That might have caused automatic
avoidance of the problematical page.
(I also did a little VirtualBox activity
instead of Hyper-V activity.)

I tended to assign FreeBSD access to
14 hw-threads in Hyper-V, leaving 2
avoided for Windows 10 Pro availability.

For the vast majority of things I'd left
the BIOS at defaults. For example,
no voltage changes of any kind --nor
CPU frequency changes. AMD Cool' n'
Quiet disabled, SVM Mode Enabled, Core
C6 state disabled. 2nd memory profile
used as-is. 2133 MT/s resulted. (It
was not overclocking-style RAM.) And
that is about it for adjustments.

I no longer have access to that system.
So that has limited what I managed to
explore. It will probably be a month or
so before I've access to a Ryzen system
of some kind again. (A BIOS update
failure also cut into the time I had
on the Ryzen system: motherboard
replacement.)
Comment 256 SF 2017-10-08 11:25:24 UTC
Noone of you did try changing the cpu-switching frequency and setting load line balancing. I did try alot of stuff with cooling and all kinds of power-settings, cooling was the first thing i recognised signifcant improvents like other people did. The second thing was cpu-switching frequency, load line balancing also marginally affects it. You will recognise the difference switching from low to extreme. Increasing soc-voltage did marginally improve it but cpu-voltage always seemed to worsen it. Deactivating boost did a huge improvement and reducing the cpu-frequency(not switching frequency) did completely stabilize it. I dont know whats wrong with you people but i previously sayed that there was someone with 2 exactly the same ryzen system, one system crashed and the other didn't. The one system had watercooling and the other system had aircooling, the system with aircooling crashed. I did read of people which sayed that low budgeg motherboards are keep crashing with ryzen systems because the power supply of them is faulty, its up to x370 chipsets.
Comment 257 Mark Millard 2017-10-08 12:07:50 UTC
(In reply to SF from comment #256)

My guess is that "system crashed" in
"one system crashed and the other didn't" is
referring to panics and shutdowns but not
to individual programs that get SIGSEGV
or other such per-process behavior.

If that is right: I never had the problem
to begin with. I would have had to change
things to cause the problem before I could
get rid of it. I had no reason to want to
create a problem that I did not have.

Basically my context was not a good test
case for such. (And I ran out of time
with the system anyway.)


I'd still like to learn how many times
in a row you can rebuild lang/ghc
without any per-process failures. (Not
intending more than a few such tries
if it seems reliable about building.)

Of course you may not want to do or
report such. But I'd be curious if
you did.
Comment 258 Don Lewis freebsd_committer freebsd_triage 2017-10-09 06:01:03 UTC
(In reply to Mark Millard from comment #255)
I never got panics.  Instead, my machine would either blank the video (text console, not running Xorg) and hang, or would silently reboot.  Building openjdk7 was a fairly reliable trigger.  Moving the shared page was a 100% fix for that problem.

I still don't fully understand the problem.  The signal trampoline location should have been far enough away from the boundary for the instruction prefetcher to stop before hitting the boundary ...
Comment 259 Don Lewis freebsd_committer freebsd_triage 2017-10-09 06:35:31 UTC
(In reply to SF from comment #256)
Neither of my AM4 boards have a VRM frequency adjustment, and none of my large collection of non-AM4 boards have it either.  I think this feature is pretty rare.

The highest temperature that I observed in my testing was about 62 C, and that was  only on very hot afternoons in an un-airconditioned room.  We only recently got temperature monitoring working in FreeBSD for Ryzen, so I don't know what the CPU temperature was in my early testing, but the room temperature was probably 10C lower on my overnight tests and it didn't seem to make any difference.  Disabling all but two cores in the BIOS also didn't make the errors go away.  That should have reduced power consumption and heat dissipation to something like 25W.  Reducing the CPU and RAM clock frequencies also did not help.  Forcing the cooling fans to run at full speed full time also did not help.  The default fan curve never cranked up the fan speed this high.  This doesn't look like a thermal or voltage regulation issue to me.

The only thing that really seemed to improve the results that I was seeing was tweaking the scheduler to limit the migration of threads between cores, and the effect was not at all subtle.

The AMD Community Forum thread that I cited has posts from a large number of Linux users who were experiencing the random segfault problem.  Many of them worked with AMD customer support who suggested trying a number of different things (mostly voltage tweaks, disabling SMT, disabling OPCACHE, etc.) that really didn't seem to solve the problem.  At best they reduced the frequency of the errors.  AMD does now say that there is a "performance marginality" issue and has been doing warranty replacements of CPUs for users who have this problem and generally people who have gotten replacement CPUs have been happy with the results.  I don't think AMD would be spending the money to do this if the problem could be fixed with a motherboard BIOS upgrade that would tweak the default VRM settings.  Apparently AMD is now able to screen for this problem because they also stated that Threadripper is not affected and it uses two of the Ryzen die (with the same stepping as the Ryzen CPU chips).

In my case, I just received a warranty CPU replacement.  The random compiler segfaults are now gone.  The only info that I had to send AMD was my CPU part and serial numbers, a description of my hardware (PSU, RAM, motherboard, BIOS revision, etc.), a photo of the BIOS screen showing voltages and temperatures, and a photo of my case interior so they could look for any potential cooling problems.  Based on that, they approved an RMA and sent me a replacement CPU.  It doesn't look like they thought that any BIOS tuning tweaks would be worth trying.  I still see some random build failures, but I see the same sorts of failures on my AMD FX-8320E.
Comment 260 Don Lewis freebsd_committer freebsd_triage 2017-10-09 06:39:24 UTC
(In reply to Mark Millard from comment #257)
I'm not convinced that the ghc problem is entirely a Ryzen hardware problem.

The first thing I did after I received my replacement CPU was to rerun poudriere, and ghc failed as usual.  Then there was a recent commit that I wanted to test, so I upgraded to the latest version of 12.0-CURRENT.  I've run poudriere twice since then, and ghc has successfully built both times.

I'm planning on doing some bisection tests to try to track down what seems to have fixed the problem, but this could take a while.
Comment 261 SF 2017-10-09 14:25:31 UTC
The difference between low budget moterboards and enthusiast-boards are the switching frequency and better power-supply in general, the cheap bords only have a low frequency and manually settings doesnt get you much higher. The frequency on some high cost boards was much more of double then of all the boards up to x370 chipset. This is where you will see the most difference in stabilizing your system. Cooling also affects it much. Disabling cores will do nothing, i have programs that crash running only on one single core and i have programs that never crash the system running on all cores. Disbaling boost and reducing cpu-frequency is the only thing that helps.
Comment 262 SF 2017-10-09 14:44:20 UTC
Will building of ghc stop if there is any error? I just put into an loop.
Comment 263 Lars Viklund 2017-10-09 14:45:53 UTC
Let me go on a tangent with my experience since April of Ryzen:

I've had both kinds of AMD problem with my Ryzen 1700 across two ASUS motherboards, the PRIME B350M-A and the PRIME X370-PRO.

It may help to clear up the discussion with that there are indeed distinct problems that AMD seems to address when you contact their RMA/support.

The first kind is the popular one, with segfaulting GCC builds. That was resolved for me by going from a 1709PGT to a 1728SUS CPU via a RMA process, in which the AMD person considered the stock cooler and case organization sufficient.

The second kind is the one where the machine simply freezes after a hour or more, not traceable to load patterns.

The symptoms for that have been described earlier in this bug and are that all activity ceases in the machine - NIC stops, screen turns off, no trace of any panics over display or network. Only thing alive is fans and the RGB lighting.

In my case, the only knob in firmware that had any effect was disabling SMT completely (note, not restricting core pairs). This changed the machine from being hanging after sinking around 2.2T of data onto ZFS over 10 gigabit networking (mlx4en) in about 1h20min, to being usable as a stable NAS for weeks.

In my second interaction with AMD, they directly directed me to disable all forms of power saving, particularly C-states like C6. As my firmware doesn't have that particular knob anymore, I could not fully comply and the machine remains unstable in SMT configuration with FreeBSD. For SF's sake, no amount of load levelling, poll frequency, voltage bumps, or other knobs have any effect, to the degree that they make sense to tune. While horrible VRM evidently may be one cause for problems, this seems to be a wider issue.

I should note that I'm running the commit that moves the top page of memory, and have also modified it to have a larger gap, as others have done in this bug, and the sigtramp tool reports that I seem to have applied it properly. Instability remains.

Now, to the interesting part of the story... a bleeding edge Arch Linux (kernel 4.13.3) is rock solid on the same hardware and same configuration. 

The only aberration that it demonstrates is kernel log entries that AMD's IOMMU may not quite be up to snuff, which I've ignored as the machine seems to work and I only need the GPU for console.

> [Thu Oct  5 23:04:13 2017] nouveau 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x0000000000000000 flags=0x0000]
> [Thu Oct  5 23:04:13 2017] nouveau 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x0000000000000080 flags=0x0000]
> [Thu Oct  5 23:04:13 2017] nouveau 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x0000000000000180 flags=0x0000]

Could it be an interaction with the NVidia card or other hardware in a non-primary PCIe slot? Could it be that the IOMMU on these chipsets is so broken that the OS needs workarounds or ignoring it completely?

I've got no clue, but I offer this data and these points to ponder.
Comment 264 SF 2017-10-09 15:02:44 UTC
No, read my recent posts. Thats the error i have with my machine, screen turning off and only some stuff inside the pc running. I call it simply a crash, whats happening during running the system is sometimes short time freezes and other kind of distortions. You can solve this by doing what i did like described within my posts.
Comment 265 Lars Viklund 2017-10-09 15:31:37 UTC
(In reply to SF from comment #264)
Dear SF,
I've read every single comment in this bug, and as I say (if you read my thread) I've tried adjusting settings both according to AMD's advice and yours.

There is no boost, and there is no powerd, and running FOUR different brands of memory at speeds between 2133 and 3200 have no measurable change in results.

I don't doubt that you've stabilized your machine. I doubt that your technique is as universally applicable as you claim it to be.

While I know you mean good, the way you interact with the community makes everyone, including me, just want to go away.

In my case, I've got the Linux escape hatch, but I'd rather drill deep and figure out if there's something missing on the FreeBSD side of things.

If this is out of line and your behaviour is fine, I'll be happy to be told by actual FreeBSD people to stop, and I'll go away.

I'm writing this in the belief that you're inadvertently hurting this investigation, and I hope you take this to heart to consider that your theory may not be complete.
Comment 266 SF 2017-10-09 15:43:21 UTC
Nvm, i quitted this bug a while ago because you people are weakminded out of my view. The solution i told you is exactly the solution for all kinds of problems people suffering so far, as i told you before iam hardwaredeveloper and i know what iam talking about. It's nonsense to me continuing talking to people like you like its always, i dont need people like you to solve my problems. Keep making making yourself ridiculous.
Comment 267 Mark Millard 2017-10-09 15:56:47 UTC
(In reply to SF from comment #262)

> Will building of ghc stop if there is any error? I just put into an loop.

A ghc build will stop with something like a bus error
when it fails.

So if your loop structure notices and stops as well
then your overall process should stop as well.

It would be important that the prior build be
cleaned out before the next build (not reused).

It would also be appropriate to know what
version of FreeBSD the test is run under.
Comment 268 Mark Millard 2017-10-09 16:57:56 UTC
Just an FYI example about the hangups:

I'm aware of one example of someone with a
Ryzen 7 that was not having any of the hangup
problems until a BIOS update problem caused
a motherboard replacement (with the same
type of motherboard).

Same CPU, same low end video card, same memory,
same power supply, same pair of M.2 NVMe SSDs,
nothing else added or removed: just the
motherboard change.

And the result was a fairly rare example of the
hangup problem: screen goes blank, lights on
usb keyboard and mouse go off, ethernet
communication stops. But the motherboard
lights keep on updating like normal. Examples
include it happening while the system is
basically idle from a user point of view.

The context was a Windows 10 Pro installation,
not FreeBSD.
Comment 269 SF 2017-10-09 17:59:14 UTC
Isn't it exactly what i sayed? The motherboard is causing this.
Comment 270 Mark Millard 2017-10-09 19:13:04 UTC
(In reply to SF from comment #269)

> Isn't it exactly what i sayed? The motherboard is causing this.

Assuming that is a reply to my comment 268:

It was just example evidence with a simple context
with little opportunity for confounding issues
to be involved.

It does certainly support the expectation that some
aspect(s) of the motherboard make the difference
for the behavior. Strongly so in my view.

So, in effect, I gave evidence supporting that view.

It is not detailed enough material to provide
evidence about what aspects of the motherboard
made the difference. So does it not contradict
any more detailed material about how
motherboards can contribute to the behavior.
Comment 271 SF 2017-10-09 21:08:09 UTC
I doubt you did truely ready any of my posts. All details are in there leading to the conclusion that the powersupply of at least all low-budget mainboards up to x370 chipset for ryzen-systems is faulty.
Comment 272 Ivan Rozhuk 2017-10-09 22:59:59 UTC
(In reply to Lars Viklund from comment #263)

I was have near issue with black screen/reboot after disk activity on asrock taichi even with BristolRidge CPU. (Gamming X and AB350M Pro4 was not affected.)
After I change thermal pad on chipset and clean(!!!)=remove all "oil" around chipset - looks like all fixed.
On VRM I do same, but later, after tests that show that no more reboots/black screen on disk+network activity.
Same "oil" leaked thermal pads was on asrock gamming X.
ASRock AB350M Pro4 - does not have thermal pad on chipset, and on VRM thermal pad looks OK, but I change it (1mm -> 0,5mm).

In my case I use rsync to upload few files (total 120gb), and system reboots/hangs after 30-50gb (system have 32GB RAM).

My be there was some PCI errors that FreeBSD cant/does not handle, and may be linux handle it, some how, in last versions. I do not test linux.
Comment 273 SF 2017-10-10 08:00:11 UTC
I've read of people who fixed it by doing similiar things like you did, someone printed himself an case for an aircooler with an 3d-printer to improve cooling on his ryzen-motherboard.
Comment 274 vali gholami 2017-12-17 07:11:12 UTC
MARKED AS SPAM
Comment 275 Sara Taylor 2018-11-26 08:21:07 UTC
MARKED AS SPAM
Comment 276 SF 2018-11-26 13:51:45 UTC
Do you use ipfw? It's caused by ipfw, after removing some lines within ipfw it doesn't crash anymore. There is some specific commands you shouldn't use.

e.x.:
don't use "ipfw table" commands

https://forums.freebsd.org/threads/ipfw-kernel-panic-solution.65907/#post-388275
Comment 277 Sara Taylor 2018-12-12 04:45:55 UTC
MARKED AS SPAM
Comment 278 Sara Taylor 2018-12-12 04:47:04 UTC
MARKED AS SPAM
Comment 279 Sara Taylor 2018-12-13 04:59:01 UTC
MARKED AS SPAM
Comment 280 Sara Taylor 2019-04-24 15:21:35 UTC
MARKED AS SPAM
Comment 281 Sara Taylor 2019-05-16 06:00:45 UTC
MARKED AS SPAM
Comment 282 Atif Aslam 2019-05-18 06:43:05 UTC
MARKED AS SPAM
Comment 283 Mark 2019-09-21 07:38:29 UTC
MARKED AS SPAM
Comment 284 Victor B. Kuntz 2019-11-03 13:41:34 UTC
MARKED AS SPAM
Comment 285 georgiadeeds 2019-11-04 18:48:53 UTC
MARKED AS SPAM
Comment 286 georgiadeeds 2019-11-04 18:50:58 UTC
MARKED AS SPAM
Comment 287 John 2020-06-01 13:26:26 UTC
MARKED AS SPAM
Comment 288 Mubashra Ameen 2020-07-30 23:07:59 UTC
MARKED AS SPAM
Comment 289 jerometaylor 2020-07-31 22:56:55 UTC
MARKED AS SPAM
Comment 290 Richard Zilinski 2020-08-12 22:22:03 UTC
MARKED AS SPAM
Comment 291 Missa Jones 2020-08-19 22:44:24 UTC
MARKED AS SPAM
Comment 292 Armaan Mitchell 2020-09-19 01:40:58 UTC
MARKED AS SPAM
Comment 293 karthikbsd 2021-06-29 12:26:50 UTC
MARKED AS SPAM
Comment 294 SF 2021-06-29 13:04:56 UTC
Since 2017 i did find out that the cause of this seemed to be simply the missing of an temperature sensor value: 60°C+ seemed to cause it to me.

My Ryzen could be surprisingly bug free since 2017, no need to buy a new one.
Comment 295 tomasjerrii 2022-01-31 04:21:53 UTC
MARKED AS SPAM