Bug 251736 - [panic] memcpy_erms general protection fault on AMD Zen 3 (Ryzen 5000 series)
Summary: [panic] memcpy_erms general protection fault on AMD Zen 3 (Ryzen 5000 series)
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2020-12-10 17:58 UTC by Val Packett
Modified: 2022-10-12 00:49 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Val Packett 2020-12-10 17:58:27 UTC
With Enhanced REP MOVSB/STOSB enabled in firmware settings on an AMD Zen 3 system, FreeBSD panics pretty quickly, e.g. (yeah, started a scrub because I initially thought the panics might've been related to my zfs pool)

[152] Fatal trap 9: general protection fault while in kernel mode
[152] cpuid = 6; apic id = 06
[152] instruction pointer       = 0x20:0xffffffff807a4380
[152] stack pointer             = 0x28:0xfffffe010e9af250
[152] frame pointer             = 0x28:0xfffffe010e9af250
[152] code segment              = base rx0, limit 0xfffff, type 0x1b
[152]                   = DPL 0, pres 1, long 1, def32 0, gran 1
[152] processor eflags  = interrupt enabled, resume, IOPL = 0
[152] current process           = 0 (z_rd_int_3)
[152] trap number               = 9
[152] panic: general protection fault
[152] cpuid = 6
[152] time = 1607620126
[152] KDB: stack backtrace:
[152] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe010e9aef60
[152] vpanic() at vpanic+0x181/frame 0xfffffe010e9aefb0
[152] panic() at panic+0x43/frame 0xfffffe010e9af010
[152] trap_fatal() at trap_fatal+0x387/frame 0xfffffe010e9af070
[152] trap() at trap+0x8e/frame 0xfffffe010e9af180
[152] calltrap() at calltrap+0x8/frame 0xfffffe010e9af180
[152] --- trap 0x9, rip = 0xffffffff807a4380, rsp = 0xfffffe010e9af250, rbp = 0xfffffe010e9af250 ---
[152] memcpy_erms() at memcpy_erms+0x130/frame 0xfffffe010e9af250
[152] abd_copy_off_cb() at abd_copy_off_cb+0x9/frame 0xfffffe010e9af260
[152] abd_iterate_func2() at abd_iterate_func2+0x27c/frame 0xfffffe010e9af340
[152] abd_copy_off() at abd_copy_off+0x1c/frame 0xfffffe010e9af360
[152] vdev_mirror_scrub_done() at vdev_mirror_scrub_done+0xd3/frame 0xfffffe010e9af3b0

kgdb "helpfully" points at support.S:577 which is just the whole macro invocation `MEMMOVE erms=1 overlap=0 begin=MEMMOVE_BEGIN end=MEMMOVE_END` :/
Comment 1 Mateusz Guzik freebsd_committer freebsd_triage 2020-12-10 18:05:11 UTC
The panicking instruction is rep movsb %ds:(%rsi),%es:(%rdi)

So to be clear, there are no issues if you disable the extension? This will give a way forward.
Comment 2 Mateusz Guzik freebsd_committer freebsd_triage 2020-12-10 18:16:44 UTC
Do you have a crashdump? register dump would be very useful here

if not, if you can repro the crash and drop to the online debugger (ddb), you can 'show reg'
Comment 3 Mateusz Guzik freebsd_committer freebsd_triage 2020-12-10 18:29:48 UTC
Moreover, does this start working with ERMS with: https://people.freebsd.org/~mjg/memcpy-no-align.diff
Comment 4 Val Packett 2020-12-10 19:20:27 UTC
(In reply to Mateusz Guzik from comment #2)

> So to be clear, there are no issues if you disable the extension?

Yep. Over an hour of uptime without it so far, including heavy work like compiling Firefox.

> Do you have a crashdump?

I did mention kgdb in the original post :)

> register dump would be very useful here

(kgdb) info registers
rax            0x10200040002009e   72620561171349662
rbx            0xfffffe010e9af201  -2194483252735
rcx            0xffe               4094
rdx            0xffe               4094
rsi            0xfffffe03c6999002  -2182806401022
rdi            0x1020004000200a0   72620561171349664
rbp            0xfffffe010e9af250  0xfffffe010e9af250
rsp            0xfffffe010e9af250  0xfffffe010e9af250
r8             0x992100170305abb4  -7412643412840174668
r9             0x3210002f0104507   225461468962768135
r10            0x10200040002009e   72620561171349662
r11            0x109ff63741b1e5    4679479465193957
r12            0x1000              4096
r13            0xfffffe010e9af2a8  -2194483252568
r14            0x0                 0
r15            0xfffffe010e9af278  -2194483252616
rip            0xffffffff807a4380  0xffffffff807a4380 <memcpy_erms+304>
eflags         0x10293             [ CF AF SF IF RF ]
cs             0x20                32
ss             0x28                40
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        <unavailable>
gs_base        <unavailable>

> does this start working with ERMS with

Going to test now.
Comment 5 Konstantin Belousov freebsd_committer freebsd_triage 2020-12-10 20:00:32 UTC
(In reply to Greg V from comment #4)
Can you look at the abd_copy_off_cb() frame and infer the arguments used for call
to memcpy() ?
%rdi 0x1020004000200a0 which is non-canonical address is clearly the issue,
and I do not believe that memcpy asm can result in such mangling.
Comment 6 Mateusz Guzik freebsd_committer freebsd_triage 2020-12-10 20:10:17 UTC
Relevant regs are messed up from the get go. The panicking case handles buffers misaligned up to 16 bytes and saves the target before doing anything. The saved address is clearly bogus.

Just in case, are you running with newest microcode for the cpu?
Comment 7 Val Packett 2020-12-10 20:17:24 UTC
(In reply to Konstantin Belousov from comment #5)

Yeah 0x1020004000200a0 is in rdi at the abd_copy_off_cb frame.

Also I've seen now similar panics with memcpy_std, a completely different panic (not memcpy) in zfs when testing erms with the patch above…

So this is probably actually ZFS's fault.
Now trying a couple months old kernel, zfs scrub is working fine with ERMS on.
Sorry for the false alarm I guess.
But this did not happen before the CPU upgrade?? Weird.

> are you running with newest microcode for the cpu?

The CPU is so new, idk if any microcode updates were released yet.
But I am running the latest firmware for the mainboard, literally 2 days old "BIOS".
Comment 8 Val Packett 2020-12-10 22:32:00 UTC
Closing as there's no ERMS bug, but it seems that there is *some* bug causing wild instability on these CPUs that was added in the last couple months.

Testing log:

- turned off all overclocking (even XMP, RAM is 2400MHz now) just to be sure, turned off ERMS, basically running stock firmware settings
- everything seems perfectly stable on a kernel from September
- rebuilt current kernel cleanly (erasing /usr/obj and ccache), unstable
- rebuilt it with llvm10, unstable
- rebuilt it without CPUTYPE, unstable
- rebuilt the exact git revision of the September kernel, fine!
Comment 9 Mateusz Guzik freebsd_committer freebsd_triage 2020-12-10 22:46:00 UTC
There were quite a few changes to zfs.

Since the bug is reproducible I think it would be best if you bisect.
Comment 10 Val Packett 2020-12-10 23:24:15 UTC
(In reply to Mateusz Guzik from comment #9)

Rather than bisecting, I was rebasing my local customizations onto ZFS related commits.

Looks like the culprit is "Update OpenZFS to 2.0.0-rc3-gfc5966"!

Interestingly I've also found https://github.com/openzfs/zfs/issues/11108 << a recent issue mentioning non-canonical addresses (seems like Linux detects that when handling faults), even if in a different context.

I'll report this to OpenZFS.
Comment 11 Val Packett 2020-12-10 23:28:30 UTC
Although even on a commit before that update I've managed to lock the system up by doing git rebase while scrubbing (processes that unlink get stuck in tx_quiesce_done_cv) :/ but that might be a separate problem from the one that causes very quick crashes..?
Comment 12 Val Packett 2020-12-11 12:37:37 UTC
Now testing upstream OpenZFS:

- 2.0.0-rc1 + compat fixes (cherry-pick 01a65c58 f6bb7c029) = stable
- 2.0.0-rc2 + INVARIANTS + compat fixes = FPU crash
- 2.0.0-rc2 + INVARIANTS + compat fixes + FPU fix (e0716250bf) = unstable
- 2.0.0-rc7 + INVARIANTS + FPU fix = stable
- master + INVARIANTS = stable
- master = stable

(my stability test being running rm -rf /usr/obj/usr and buildkernel a few times while zpool scrub is going on)

Conclusion: seems like that problem has been fixed upstream. Whatever it was. I'm not seeing anything obvious in the commit log o_0
Comment 13 Konstantin Belousov freebsd_committer freebsd_triage 2020-12-11 13:00:55 UTC
(In reply to Greg V from comment #12)
What is 'FPU fix (e0716250bf)' ? I am unable to find the hash in openzfs repo.
I am sure nothing like 'fpu fix' was committed to freebsd.
Comment 14 Val Packett 2020-12-11 15:50:35 UTC
(In reply to Konstantin Belousov from comment #13)

https://github.com/openzfs/zfs/commit/e0716250bf34159082d96f04200063e3edb114b6

It was committed yesterday. Apparently authored by you :)