With Enhanced REP MOVSB/STOSB enabled in firmware settings on an AMD Zen 3 system, FreeBSD panics pretty quickly, e.g. (yeah, started a scrub because I initially thought the panics might've been related to my zfs pool) [152] Fatal trap 9: general protection fault while in kernel mode [152] cpuid = 6; apic id = 06 [152] instruction pointer = 0x20:0xffffffff807a4380 [152] stack pointer = 0x28:0xfffffe010e9af250 [152] frame pointer = 0x28:0xfffffe010e9af250 [152] code segment = base rx0, limit 0xfffff, type 0x1b [152] = DPL 0, pres 1, long 1, def32 0, gran 1 [152] processor eflags = interrupt enabled, resume, IOPL = 0 [152] current process = 0 (z_rd_int_3) [152] trap number = 9 [152] panic: general protection fault [152] cpuid = 6 [152] time = 1607620126 [152] KDB: stack backtrace: [152] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe010e9aef60 [152] vpanic() at vpanic+0x181/frame 0xfffffe010e9aefb0 [152] panic() at panic+0x43/frame 0xfffffe010e9af010 [152] trap_fatal() at trap_fatal+0x387/frame 0xfffffe010e9af070 [152] trap() at trap+0x8e/frame 0xfffffe010e9af180 [152] calltrap() at calltrap+0x8/frame 0xfffffe010e9af180 [152] --- trap 0x9, rip = 0xffffffff807a4380, rsp = 0xfffffe010e9af250, rbp = 0xfffffe010e9af250 --- [152] memcpy_erms() at memcpy_erms+0x130/frame 0xfffffe010e9af250 [152] abd_copy_off_cb() at abd_copy_off_cb+0x9/frame 0xfffffe010e9af260 [152] abd_iterate_func2() at abd_iterate_func2+0x27c/frame 0xfffffe010e9af340 [152] abd_copy_off() at abd_copy_off+0x1c/frame 0xfffffe010e9af360 [152] vdev_mirror_scrub_done() at vdev_mirror_scrub_done+0xd3/frame 0xfffffe010e9af3b0 kgdb "helpfully" points at support.S:577 which is just the whole macro invocation `MEMMOVE erms=1 overlap=0 begin=MEMMOVE_BEGIN end=MEMMOVE_END` :/
The panicking instruction is rep movsb %ds:(%rsi),%es:(%rdi) So to be clear, there are no issues if you disable the extension? This will give a way forward.
Do you have a crashdump? register dump would be very useful here if not, if you can repro the crash and drop to the online debugger (ddb), you can 'show reg'
Moreover, does this start working with ERMS with: https://people.freebsd.org/~mjg/memcpy-no-align.diff
(In reply to Mateusz Guzik from comment #2) > So to be clear, there are no issues if you disable the extension? Yep. Over an hour of uptime without it so far, including heavy work like compiling Firefox. > Do you have a crashdump? I did mention kgdb in the original post :) > register dump would be very useful here (kgdb) info registers rax 0x10200040002009e 72620561171349662 rbx 0xfffffe010e9af201 -2194483252735 rcx 0xffe 4094 rdx 0xffe 4094 rsi 0xfffffe03c6999002 -2182806401022 rdi 0x1020004000200a0 72620561171349664 rbp 0xfffffe010e9af250 0xfffffe010e9af250 rsp 0xfffffe010e9af250 0xfffffe010e9af250 r8 0x992100170305abb4 -7412643412840174668 r9 0x3210002f0104507 225461468962768135 r10 0x10200040002009e 72620561171349662 r11 0x109ff63741b1e5 4679479465193957 r12 0x1000 4096 r13 0xfffffe010e9af2a8 -2194483252568 r14 0x0 0 r15 0xfffffe010e9af278 -2194483252616 rip 0xffffffff807a4380 0xffffffff807a4380 <memcpy_erms+304> eflags 0x10293 [ CF AF SF IF RF ] cs 0x20 32 ss 0x28 40 ds <unavailable> es <unavailable> fs <unavailable> gs <unavailable> fs_base <unavailable> gs_base <unavailable> > does this start working with ERMS with Going to test now.
(In reply to Greg V from comment #4) Can you look at the abd_copy_off_cb() frame and infer the arguments used for call to memcpy() ? %rdi 0x1020004000200a0 which is non-canonical address is clearly the issue, and I do not believe that memcpy asm can result in such mangling.
Relevant regs are messed up from the get go. The panicking case handles buffers misaligned up to 16 bytes and saves the target before doing anything. The saved address is clearly bogus. Just in case, are you running with newest microcode for the cpu?
(In reply to Konstantin Belousov from comment #5) Yeah 0x1020004000200a0 is in rdi at the abd_copy_off_cb frame. Also I've seen now similar panics with memcpy_std, a completely different panic (not memcpy) in zfs when testing erms with the patch above… So this is probably actually ZFS's fault. Now trying a couple months old kernel, zfs scrub is working fine with ERMS on. Sorry for the false alarm I guess. But this did not happen before the CPU upgrade?? Weird. > are you running with newest microcode for the cpu? The CPU is so new, idk if any microcode updates were released yet. But I am running the latest firmware for the mainboard, literally 2 days old "BIOS".
Closing as there's no ERMS bug, but it seems that there is *some* bug causing wild instability on these CPUs that was added in the last couple months. Testing log: - turned off all overclocking (even XMP, RAM is 2400MHz now) just to be sure, turned off ERMS, basically running stock firmware settings - everything seems perfectly stable on a kernel from September - rebuilt current kernel cleanly (erasing /usr/obj and ccache), unstable - rebuilt it with llvm10, unstable - rebuilt it without CPUTYPE, unstable - rebuilt the exact git revision of the September kernel, fine!
There were quite a few changes to zfs. Since the bug is reproducible I think it would be best if you bisect.
(In reply to Mateusz Guzik from comment #9) Rather than bisecting, I was rebasing my local customizations onto ZFS related commits. Looks like the culprit is "Update OpenZFS to 2.0.0-rc3-gfc5966"! Interestingly I've also found https://github.com/openzfs/zfs/issues/11108 << a recent issue mentioning non-canonical addresses (seems like Linux detects that when handling faults), even if in a different context. I'll report this to OpenZFS.
Although even on a commit before that update I've managed to lock the system up by doing git rebase while scrubbing (processes that unlink get stuck in tx_quiesce_done_cv) :/ but that might be a separate problem from the one that causes very quick crashes..?
Now testing upstream OpenZFS: - 2.0.0-rc1 + compat fixes (cherry-pick 01a65c58 f6bb7c029) = stable - 2.0.0-rc2 + INVARIANTS + compat fixes = FPU crash - 2.0.0-rc2 + INVARIANTS + compat fixes + FPU fix (e0716250bf) = unstable - 2.0.0-rc7 + INVARIANTS + FPU fix = stable - master + INVARIANTS = stable - master = stable (my stability test being running rm -rf /usr/obj/usr and buildkernel a few times while zpool scrub is going on) Conclusion: seems like that problem has been fixed upstream. Whatever it was. I'm not seeing anything obvious in the commit log o_0
(In reply to Greg V from comment #12) What is 'FPU fix (e0716250bf)' ? I am unable to find the hash in openzfs repo. I am sure nothing like 'fpu fix' was committed to freebsd.
(In reply to Konstantin Belousov from comment #13) https://github.com/openzfs/zfs/commit/e0716250bf34159082d96f04200063e3edb114b6 It was committed yesterday. Apparently authored by you :)