Bug 239351 - panic: spin lock held too long under heavy load (net/wireguard affected, SOLVED: Upgrading BIOS to 1.24)
Summary: panic: spin lock held too long under heavy load (net/wireguard affected, SOLV...
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2019-07-20 20:31 UTC by Evilham
Modified: 2020-01-21 01:32 UTC (History)
6 users (show)

See Also:


Attachments
coredump_spin_lock_held_too_long (122.95 KB, text/plain)
2019-07-20 20:31 UTC, Evilham
no flags Details
coredump_nowg_onboot_net_up (132.04 KB, text/plain)
2019-07-24 10:12 UTC, Evilham
no flags Details
coredump_nowg_onx_drm-kmod (230.47 KB, text/plain)
2019-07-24 10:13 UTC, Evilham
no flags Details
proposed patch (3.81 KB, patch)
2019-07-24 13:46 UTC, Mark Johnston
no flags Details | Diff
coredump_current_patch206036 (314.07 KB, text/plain)
2019-07-24 21:10 UTC, Evilham
no flags Details
kgdb83_thread_applyallbt_patch206036.gz (71.26 KB, application/gzip)
2019-07-24 23:12 UTC, Evilham
no flags Details
kgdb script (1.36 KB, text/plain)
2019-07-25 01:08 UTC, Mark Johnston
no flags Details
acttrace_patch_206036 (21.48 KB, text/plain)
2019-07-25 07:59 UTC, Evilham
no flags Details
tid_frame_patch206038 (4.97 KB, text/plain)
2019-07-25 08:01 UTC, Evilham
no flags Details
panic_assertion_patch206036.tar.gz (79.77 KB, application/gzip)
2019-07-25 09:14 UTC, Evilham
no flags Details
VOP_UNSET_TEXT debug patch (1.17 KB, patch)
2019-07-25 16:03 UTC, Mark Johnston
no flags Details | Diff
D20327_VOP_UNSET_TEXT_spin_lock_held_too_long (86.23 KB, application/gzip)
2019-07-25 18:51 UTC, Evilham
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Evilham 2019-07-20 20:31:08 UTC
Created attachment 205949 [details]
coredump_spin_lock_held_too_long

Hello, I had been debugging this for a while and was 95% sure my panic was being caused by a graphics thing, turns out I was wrong and it was caused by net/wireguard.

The relevant packages and versions:

	wireguard-go-0.0.20190517
	wireguard-0.0.20190702


It may also be relevant to mention that I use WireGuard to have IPv6-connectivity on the go, it only uses IPv6-addresses except for the IPv4 of the peer.

I can reproduce the issue by starting 20 parallel downloads of big files (8G), e.g.:

	for i in `seq 1 20`; do curl -6 ${BIG_FILE} > /dev/null &; done

(The -6 param is because of the use I give to WireGuard as on-the-go IPv6)

CPU usage rises (I think that's to be expected with WireGuard) and after some seconds the panic occurs (see core dump).


It *may* also be relevant that this is running on a ThinkPad A485, which is affected by #231760.
That means that ATM FreeBSD can only run with hw.pci.mcfg=0 in /boot/loader.conf.
Comment 1 Evilham 2019-07-20 20:36:24 UTC
I should have also mentioned that this does not occur with WireGuard disabled.
Comment 2 Bernhard Froehlich freebsd_committer freebsd_triage 2019-07-24 08:49:54 UTC
Assign to net@ since this is a kernel panic and seems to be only triggered by wireguard.
Comment 3 Evilham 2019-07-24 10:12:00 UTC
Created attachment 206031 [details]
coredump_nowg_onboot_net_up

Actually, I meant to update this bug report yesterday but didn't manage to.

I have seen this kernel panic again without WireGuard, though it does happen way more often with it and I can produce the panic with it but can't reliably (ehem) panic the system without it.

It actually happened once while booting, but after network was initialised, so network could still be at fault.

Now I'm quite inclined to believe that this is related to #231760 and booting with hw.pci.mcfg=0 (the only way to boot atm); but my knowledge of FreeBSD's inners is not up to the task to fully pin-point it.

I have two more dumps, please note that even if drm-kmod is present and somewhat verbose, it was not present on the system at the time of the first dump; so it'd be misleading to think it's caused by it.
Comment 4 Evilham 2019-07-24 10:13:16 UTC
Created attachment 206032 [details]
coredump_nowg_onx_drm-kmod

This happened without WireGuard in an X session.
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2019-07-24 13:46:49 UTC
Created attachment 206036 [details]
proposed patch

This is a bit of a shot in the dark, but are you able to reproduce the problem when the attached patch is applied to your kernel?  The bug it fixes will cause a spinlock leak, so interrupts are left disabled in a running thread.  That can trigger the "spinlock held too long" panic.
Comment 6 Evilham 2019-07-24 21:10:14 UTC
Created attachment 206043 [details]
coredump_current_patch206036

Thank you Mark, just tested both HEAD of 13-CURRENT and with your shot-in-the-dark patch, both cases resulted in a panic after running my "trustworthy" WireGuard panic-maker :-p.

Attached is a panic of the kernel with your patch, forgot to dump the one for HEAD but I doubt it's more interesting (can produce it if requested).
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2019-07-24 22:54:50 UTC
(In reply to Evilham from comment #6)
Ok.  Could you please run the following command and attach the output?

# echo "thread apply all bt" | kgdb83 /boot/kernel/kernel /var/crash/vmcore.2
Comment 8 Evilham 2019-07-24 23:12:39 UTC
Created attachment 206045 [details]
kgdb83_thread_applyallbt_patch206036.gz

Here you go!
(Had to gz compress because it's about 1.4M)
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2019-07-25 01:08:05 UTC
Created attachment 206046 [details]
kgdb script

Thanks.  Now, please try downloading the attached file, and run:

# printf "source /path/to/acttace.py\nacttrace\n" | kgdb83 /boot/kernel/kernel /var/crash/vmcore.2

Then save the output here.  The kernel panicked because one of the cores failed to respond to an interrupt, and it's not clear to me what that core was doing at the time.

It would also be helpful to see output from:

# kgdb /boot/kernel/kernel /var/crash/vmcore.2
(kgdb) tid 100168
(kgdb) frame 4
(kgdb) p other_cpus
Comment 10 Evilham 2019-07-25 07:59:48 UTC
Created attachment 206050 [details]
acttrace_patch_206036

I'm starting to be really glad I opened this bug instead of trying to debug it myself.
Comment 11 Evilham 2019-07-25 08:01:04 UTC
Created attachment 206051 [details]
tid_frame_patch206038

# echo -e "tid 100168\nframe 4\np other_cpus\n" |  kgdb /boot/kernel/kernel vmcore.2 > tid_frame_patch20603
Comment 12 Evilham 2019-07-25 09:14:21 UTC
Created attachment 206056 [details]
panic_assertion_patch206036.tar.gz

This panic occurred with 13.0-CURRENT r350275+d78b126a34dc(master) + patch206036  over night; the system was compiling some ports in poudriere for CURRENT.

I was about to discard it as it being the same as the others, but noticed it has a different message, namely
panic: Assertion error_ == 0 failed at /freebsd/src/sys/vm/vm_map.c:553

Attached is the dump, backtrace, acttrace and I managed to identify tid and frame but p other_cpus returned "No symbol "other_cpus" in current context.", which is probably normal since this is a totally different thing, but I thought I'd give it a try anyway.

I hope this is useful.
Comment 13 Mark Johnston freebsd_committer freebsd_triage 2019-07-25 16:02:16 UTC
These problems don't appear to be network-related.

I'm not sure how the page daemon got stuck waiting for its TLB shootdown to complete.  clang unfortunately optimizes away the variable which would show which CPU(s) we were waiting for.

For the VOP_UNSET_TEXT panic, please try applying the attached patch; it should give more debugging info when the problem happens again.  I think you can drop the patch I supplied in comment 5.

Please also try applying the patch in https://reviews.freebsd.org/D20327 .  It should allow you to boot without having to set hw.pci.mcfg=0.

Finally, are you willing to share the vmcores with me so that I can take a closer look?
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2019-07-25 16:03:07 UTC
Created attachment 206061 [details]
VOP_UNSET_TEXT debug patch

Forgot to attach the patch in the previous comment.
Comment 15 Konstantin Belousov freebsd_committer freebsd_triage 2019-07-25 16:37:58 UTC
(In reply to Mark Johnston from comment #13)
You usually can conclude which cpu was blocked by looking at smp_tlb_generation and pcpu pc_smp_tlb_done.  I even added printint of pc_smp_tlb_done to show pcpu to make this easier.

Also, what is the UNSET_TEXT panic ?  Can you show me the backtrace ?  Which system version ?
Comment 16 Mark Johnston freebsd_committer freebsd_triage 2019-07-25 17:00:35 UTC
(In reply to Konstantin Belousov from comment #15)
See comment 12: it was on r350275+r350310.  The stack was:

panic: Assertion error_ == 0 failed at /freebsd/src/sys/vm/vm_map.c:553
cpuid = 4
time = 1564016615
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a9081780
vpanic() at vpanic+0x19d/frame 0xfffffe00a90817d0
panic() at panic+0x43/frame 0xfffffe00a9081830
vm_map_entry_set_vnode_text() at vm_map_entry_set_vnode_text+0x1de/frame 0xfffffe00a9081870
vm_map_process_deferred() at vm_map_process_deferred+0x70/frame 0xfffffe00a9081890
vm_map_remove() at vm_map_remove+0xc6/frame 0xfffffe00a90818c0
vmspace_exit() at vmspace_exit+0xd3/frame 0xfffffe00a9081900
exit1() at exit1+0x5ad/frame 0xfffffe00a9081970
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00a9081980
amd64_syscall() at amd64_syscall+0x2bb/frame 0xfffffe00a9081ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00a9081ab0
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80076092a, rsp = 0x7fffffffcee8, rbp = 0x7fffffffcf00 ---
Comment 17 Evilham 2019-07-25 18:51:41 UTC
Created attachment 206063 [details]
D20327_VOP_UNSET_TEXT_spin_lock_held_too_long

Hello, recapitulating:

now I'm running r350327 + D20327 + the VOP_UNSET_TEXT patch in comment 14. (r350327 already includes the patch in comment 5 that Mark had proposed earlier).

Commented out the relevant line in /boot/loader.conf and after booting system reports:
# sysctl hw.pci.mcfg
hw.pci.mcfg: 1

With this setup I can still use my "WireGuard trick" (50-100 parallel downloads of big files) to produce the panic, it looks like it's not *just* a duplicate of #231760, at least nothing that would also be solved with D20327.

Notice that compiling world and kernel with all CPUs does not trigger it, still haven't found a way to make it happen without WireGuard, but it has happened at some point without it during daily work.

Attached are the public bits of the dump, I'm sending Mark this coredump privately.

I'll be using this setup without WireGuard daily and see if the problem triggers during normal usage, with all the kernel compiling these days I haven't kept it up long enough to test that.
Comment 18 Konstantin Belousov freebsd_committer freebsd_triage 2019-07-25 21:00:01 UTC
(In reply to Mark Johnston from comment #16)
So this is amd ?  Can you recheck what is the value of sysctl
hw.lower_amd64_sharedpage ?
Comment 19 Evilham 2019-07-25 21:31:16 UTC
(In reply to Konstantin Belousov from comment #18)

Indeed AMD.

The requested sysctl:

# sysctl hw.lower_amd64_sharedpage
hw.lower_amd64_sharedpage: 1


System data from boot log:

CPU: AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx (2195.94-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x810f10  Family=0x17  Model=0x11  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
  Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr,IBPB>
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics
real memory  = 34359738368 (32768 MB)
avail memory = 32123269120 (30635 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <LENOVO TP-R0W  >
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 hardware threads
Comment 20 Mark Millard 2019-07-25 21:44:47 UTC
(In reply to Mark Johnston from comment #13)

Just in case the extra information might prove
to be of use: I had a one-time 'spin lock held
too long' back on 2019-Jul-06 that I reported
in:

https://lists.freebsd.org/pipermail/freebsd-current/2019-July/073779.html

It was under Hyper-V and had nothing to do with wireguard
or anything like that: buildworld buildkernel under a
-r349794 based build. This was on a 1950X Threadripper.
But it did not repeat in my activity after that, despite
trying a bunch of (re)builds. I've since updated beyond
that point).

The message submittal had a backtrace.
Comment 21 Konstantin Belousov freebsd_committer freebsd_triage 2019-07-25 21:58:03 UTC
(In reply to Evilham from comment #19)
Make sure that you run absolutely latest BIOS.

AMD does not publish errata documents for its mobile cpus, so I do not know
if any of existing Ryzen erratas are applicable to your machine.  We specifically
allow some known workarounds for desktop chips, and rumors say that it is not
needed anymore with BIOS updates.  Try the same for mobile.
Comment 22 Evilham 2019-07-25 23:17:30 UTC
(In reply to Konstantin Belousov from comment #21)

> Make sure that you run absolutely latest BIOS.

Oh, wth.... That didn't cross my mind. I had 1.16, which was latest last time I checked (end of June), now there is 1.22 and 1.24.

Upgraded to 1.24, suddenly there are issues with drm-kmod, booted single-user, disabled it, system boots fine, ran my "WireGuard" panic-maker and... it's not working anymore (aka no panic).

These are the release notes, does any of that sound like it could have fixed the issue?
https://download.lenovo.com/pccbbs/mobiles/r0wuj56w.txt

This sounds super suspicious to me but I can't really parse the meaning out of this English:

- [Important] Modify WMI AmdVt item can't modify and remove unuse item


I'll reactivate WireGuard on my day-to-day and keep an eye on things, if in a few days the panic doesn't happen again, I'll close this bug.
Comment 23 Konstantin Belousov freebsd_committer freebsd_triage 2019-07-26 08:49:40 UTC
(In reply to Evilham from comment #22)
Lenovo changelog is useless, and the real changes that affect CPU bugs come from AMD AGESA blobs.  The later returns us to the issue that AMD refuses to provide any CPU documentation, even under NDA.
Comment 24 Evilham 2019-08-03 17:52:43 UTC
The issue really just disappeared by upgrading the BIOS to 1.24.

I wrote everything I remembered about getting the Lenovo ThinkPad A485 to work with FreeBSD reliably here: https://evilham.com/en/blog/2019-ThinkPad-A485-FreeBSD/

If anyone with an account wants to add the information FreeBSD's Wiki, please do so, it's where it belongs, there is no need for crediting or linking to that site, since it's mostly for my own future reference and in the off-chance that someone with similar issues finds it.

Thanks a lot to everyone who helped me debug this, I actually learnt a lot and will hopefully be able to use it in the future.