Created attachment 205949 [details] coredump_spin_lock_held_too_long Hello, I had been debugging this for a while and was 95% sure my panic was being caused by a graphics thing, turns out I was wrong and it was caused by net/wireguard. The relevant packages and versions: wireguard-go-0.0.20190517 wireguard-0.0.20190702 It may also be relevant to mention that I use WireGuard to have IPv6-connectivity on the go, it only uses IPv6-addresses except for the IPv4 of the peer. I can reproduce the issue by starting 20 parallel downloads of big files (8G), e.g.: for i in `seq 1 20`; do curl -6 ${BIG_FILE} > /dev/null &; done (The -6 param is because of the use I give to WireGuard as on-the-go IPv6) CPU usage rises (I think that's to be expected with WireGuard) and after some seconds the panic occurs (see core dump). It *may* also be relevant that this is running on a ThinkPad A485, which is affected by #231760. That means that ATM FreeBSD can only run with hw.pci.mcfg=0 in /boot/loader.conf.
I should have also mentioned that this does not occur with WireGuard disabled.
Assign to net@ since this is a kernel panic and seems to be only triggered by wireguard.
Created attachment 206031 [details] coredump_nowg_onboot_net_up Actually, I meant to update this bug report yesterday but didn't manage to. I have seen this kernel panic again without WireGuard, though it does happen way more often with it and I can produce the panic with it but can't reliably (ehem) panic the system without it. It actually happened once while booting, but after network was initialised, so network could still be at fault. Now I'm quite inclined to believe that this is related to #231760 and booting with hw.pci.mcfg=0 (the only way to boot atm); but my knowledge of FreeBSD's inners is not up to the task to fully pin-point it. I have two more dumps, please note that even if drm-kmod is present and somewhat verbose, it was not present on the system at the time of the first dump; so it'd be misleading to think it's caused by it.
Created attachment 206032 [details] coredump_nowg_onx_drm-kmod This happened without WireGuard in an X session.
Created attachment 206036 [details] proposed patch This is a bit of a shot in the dark, but are you able to reproduce the problem when the attached patch is applied to your kernel? The bug it fixes will cause a spinlock leak, so interrupts are left disabled in a running thread. That can trigger the "spinlock held too long" panic.
Created attachment 206043 [details] coredump_current_patch206036 Thank you Mark, just tested both HEAD of 13-CURRENT and with your shot-in-the-dark patch, both cases resulted in a panic after running my "trustworthy" WireGuard panic-maker :-p. Attached is a panic of the kernel with your patch, forgot to dump the one for HEAD but I doubt it's more interesting (can produce it if requested).
(In reply to Evilham from comment #6) Ok. Could you please run the following command and attach the output? # echo "thread apply all bt" | kgdb83 /boot/kernel/kernel /var/crash/vmcore.2
Created attachment 206045 [details] kgdb83_thread_applyallbt_patch206036.gz Here you go! (Had to gz compress because it's about 1.4M)
Created attachment 206046 [details] kgdb script Thanks. Now, please try downloading the attached file, and run: # printf "source /path/to/acttace.py\nacttrace\n" | kgdb83 /boot/kernel/kernel /var/crash/vmcore.2 Then save the output here. The kernel panicked because one of the cores failed to respond to an interrupt, and it's not clear to me what that core was doing at the time. It would also be helpful to see output from: # kgdb /boot/kernel/kernel /var/crash/vmcore.2 (kgdb) tid 100168 (kgdb) frame 4 (kgdb) p other_cpus
Created attachment 206050 [details] acttrace_patch_206036 I'm starting to be really glad I opened this bug instead of trying to debug it myself.
Created attachment 206051 [details] tid_frame_patch206038 # echo -e "tid 100168\nframe 4\np other_cpus\n" | kgdb /boot/kernel/kernel vmcore.2 > tid_frame_patch20603
Created attachment 206056 [details] panic_assertion_patch206036.tar.gz This panic occurred with 13.0-CURRENT r350275+d78b126a34dc(master) + patch206036 over night; the system was compiling some ports in poudriere for CURRENT. I was about to discard it as it being the same as the others, but noticed it has a different message, namely panic: Assertion error_ == 0 failed at /freebsd/src/sys/vm/vm_map.c:553 Attached is the dump, backtrace, acttrace and I managed to identify tid and frame but p other_cpus returned "No symbol "other_cpus" in current context.", which is probably normal since this is a totally different thing, but I thought I'd give it a try anyway. I hope this is useful.
These problems don't appear to be network-related. I'm not sure how the page daemon got stuck waiting for its TLB shootdown to complete. clang unfortunately optimizes away the variable which would show which CPU(s) we were waiting for. For the VOP_UNSET_TEXT panic, please try applying the attached patch; it should give more debugging info when the problem happens again. I think you can drop the patch I supplied in comment 5. Please also try applying the patch in https://reviews.freebsd.org/D20327 . It should allow you to boot without having to set hw.pci.mcfg=0. Finally, are you willing to share the vmcores with me so that I can take a closer look?
Created attachment 206061 [details] VOP_UNSET_TEXT debug patch Forgot to attach the patch in the previous comment.
(In reply to Mark Johnston from comment #13) You usually can conclude which cpu was blocked by looking at smp_tlb_generation and pcpu pc_smp_tlb_done. I even added printint of pc_smp_tlb_done to show pcpu to make this easier. Also, what is the UNSET_TEXT panic ? Can you show me the backtrace ? Which system version ?
(In reply to Konstantin Belousov from comment #15) See comment 12: it was on r350275+r350310. The stack was: panic: Assertion error_ == 0 failed at /freebsd/src/sys/vm/vm_map.c:553 cpuid = 4 time = 1564016615 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a9081780 vpanic() at vpanic+0x19d/frame 0xfffffe00a90817d0 panic() at panic+0x43/frame 0xfffffe00a9081830 vm_map_entry_set_vnode_text() at vm_map_entry_set_vnode_text+0x1de/frame 0xfffffe00a9081870 vm_map_process_deferred() at vm_map_process_deferred+0x70/frame 0xfffffe00a9081890 vm_map_remove() at vm_map_remove+0xc6/frame 0xfffffe00a90818c0 vmspace_exit() at vmspace_exit+0xd3/frame 0xfffffe00a9081900 exit1() at exit1+0x5ad/frame 0xfffffe00a9081970 sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00a9081980 amd64_syscall() at amd64_syscall+0x2bb/frame 0xfffffe00a9081ab0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00a9081ab0 --- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80076092a, rsp = 0x7fffffffcee8, rbp = 0x7fffffffcf00 ---
Created attachment 206063 [details] D20327_VOP_UNSET_TEXT_spin_lock_held_too_long Hello, recapitulating: now I'm running r350327 + D20327 + the VOP_UNSET_TEXT patch in comment 14. (r350327 already includes the patch in comment 5 that Mark had proposed earlier). Commented out the relevant line in /boot/loader.conf and after booting system reports: # sysctl hw.pci.mcfg hw.pci.mcfg: 1 With this setup I can still use my "WireGuard trick" (50-100 parallel downloads of big files) to produce the panic, it looks like it's not *just* a duplicate of #231760, at least nothing that would also be solved with D20327. Notice that compiling world and kernel with all CPUs does not trigger it, still haven't found a way to make it happen without WireGuard, but it has happened at some point without it during daily work. Attached are the public bits of the dump, I'm sending Mark this coredump privately. I'll be using this setup without WireGuard daily and see if the problem triggers during normal usage, with all the kernel compiling these days I haven't kept it up long enough to test that.
(In reply to Mark Johnston from comment #16) So this is amd ? Can you recheck what is the value of sysctl hw.lower_amd64_sharedpage ?
(In reply to Konstantin Belousov from comment #18) Indeed AMD. The requested sysctl: # sysctl hw.lower_amd64_sharedpage hw.lower_amd64_sharedpage: 1 System data from boot log: CPU: AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx (2195.94-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x810f10 Family=0x17 Model=0x11 Stepping=0 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM> AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX> Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr,IBPB> SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768 TSC: P-state invariant, performance statistics real memory = 34359738368 (32768 MB) avail memory = 32123269120 (30635 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <LENOVO TP-R0W > FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 hardware threads
(In reply to Mark Johnston from comment #13) Just in case the extra information might prove to be of use: I had a one-time 'spin lock held too long' back on 2019-Jul-06 that I reported in: https://lists.freebsd.org/pipermail/freebsd-current/2019-July/073779.html It was under Hyper-V and had nothing to do with wireguard or anything like that: buildworld buildkernel under a -r349794 based build. This was on a 1950X Threadripper. But it did not repeat in my activity after that, despite trying a bunch of (re)builds. I've since updated beyond that point). The message submittal had a backtrace.
(In reply to Evilham from comment #19) Make sure that you run absolutely latest BIOS. AMD does not publish errata documents for its mobile cpus, so I do not know if any of existing Ryzen erratas are applicable to your machine. We specifically allow some known workarounds for desktop chips, and rumors say that it is not needed anymore with BIOS updates. Try the same for mobile.
(In reply to Konstantin Belousov from comment #21) > Make sure that you run absolutely latest BIOS. Oh, wth.... That didn't cross my mind. I had 1.16, which was latest last time I checked (end of June), now there is 1.22 and 1.24. Upgraded to 1.24, suddenly there are issues with drm-kmod, booted single-user, disabled it, system boots fine, ran my "WireGuard" panic-maker and... it's not working anymore (aka no panic). These are the release notes, does any of that sound like it could have fixed the issue? https://download.lenovo.com/pccbbs/mobiles/r0wuj56w.txt This sounds super suspicious to me but I can't really parse the meaning out of this English: - [Important] Modify WMI AmdVt item can't modify and remove unuse item I'll reactivate WireGuard on my day-to-day and keep an eye on things, if in a few days the panic doesn't happen again, I'll close this bug.
(In reply to Evilham from comment #22) Lenovo changelog is useless, and the real changes that affect CPU bugs come from AMD AGESA blobs. The later returns us to the issue that AMD refuses to provide any CPU documentation, even under NDA.
The issue really just disappeared by upgrading the BIOS to 1.24. I wrote everything I remembered about getting the Lenovo ThinkPad A485 to work with FreeBSD reliably here: https://evilham.com/en/blog/2019-ThinkPad-A485-FreeBSD/ If anyone with an account wants to add the information FreeBSD's Wiki, please do so, it's where it belongs, there is no need for crediting or linking to that site, since it's mostly for my own future reference and in the off-chance that someone with similar issues finds it. Thanks a lot to everyone who helped me debug this, I actually learnt a lot and will hopefully be able to use it in the future.