Created attachment 217791 [details] -CURRENT dmesg with KCSAN messages and the kernel panic On a virtualized amd64 VM (Hyper-V) of -CURRENT r365373 the system panics reproducible with a KCSAN enabled kernel. The KERNCONF is the following: include GENERIC ident GENERIC-SANTI options KCSAN options RATELIMIT options TCPHPTS options KERN_TLS The full panic message with the stacktrace is the following panic: mtx_lock_spin: recursed on non-recursive mutex et_hw_mtx @ /boiler/nfs/src/sys/kern/kern_clocksource.c:850 cpuid = 0 time = 1599396342 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x34/frame 0xfffffe0006e74030 kdb_backtrace() at kdb_backtrace+0x6a/frame 0xfffffe0006e740e0 vpanic() at vpanic+0x229/frame 0xfffffe0006e74130 panic() at panic+0x53/frame 0xfffffe0006e74190 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie/frame 0xfffffe0006e741f0 cpu_new_callout() at cpu_new_callout+0xc5/frame 0xfffffe0006e74240 callout_cc_add() at callout_cc_add+0x2aa/frame 0xfffffe0006e742a0 callout_reset_sbt_on() at callout_reset_sbt_on+0x5ce/frame 0xfffffe0006e74340 callout_schedule() at callout_schedule+0x76/frame 0xfffffe0006e74380 vtterm_post_input() at vtterm_post_input+0xa6/frame 0xfffffe0006e743a0 termteken_post_input() at termteken_post_input+0x2e/frame 0xfffffe0006e743c0 teken_input() at teken_input+0x1f8/frame 0xfffffe0006e74420 termcn_cnputc() at termcn_cnputc+0xa7/frame 0xfffffe0006e74460 cnputc() at cnputc+0x80/frame 0xfffffe0006e744a0 cnputs() at cnputs+0x90/frame 0xfffffe0006e744d0 putchar() at putchar+0x236/frame 0xfffffe0006e74580 kvprintf() at kvprintf+0xfa7/frame 0xfffffe0006e746d0 vprintf() at vprintf+0x9b/frame 0xfffffe0006e747b0 printf() at printf+0x53/frame 0xfffffe0006e74810 kcsan_access() at kcsan_access+0x19b/frame 0xfffffe0006e748f0 lapic_et_start() at lapic_et_start+0x628/frame 0xfffffe0006e74940 et_start() at et_start+0xdd/frame 0xfffffe0006e74980 loadtimer() at loadtimer+0x1ef/frame 0xfffffe0006e749d0 handleevents() at handleevents+0x401/frame 0xfffffe0006e74a30 cpu_activeclock() at cpu_activeclock+0xd0/frame 0xfffffe0006e74a70 cpu_idle() at cpu_idle+0xea/frame 0xfffffe0006e74aa0 sched_idletd() at sched_idletd+0x5de/frame 0xfffffe0006e74bb0 fork_exit() at fork_exit+0xaa/frame 0xfffffe0006e74bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0006e74bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic
Created attachment 217796 [details] preliminary patch
It looks like this is a KCSan false positive on the LAPIC MMIO mapping. Since this is a guest that doesn't use the x2APIC (and therefore must use MMIO for LAPIC access instead of MSRs), all CPUs will share the same LAPIC mapping. However, the underlying physical access will be routed to the local CPU's APIC, so multiple CPUs aren't actually racing on the same hardware resource. Or at least I hope that's the source of the problem, otherwise Hyper-V really misconfigured the guest. Since the LAPIC access happens in the course of configuring an event timer, and the KCSan warning printf itself tries to configure a callout, we get the panic. Can you try the attached (build-tested only) patch?
(In reply to Jason A. Harmening from comment #2) I won't think, that the Hyper-V virtualization is the problem here. I applied your patch to my working tree and I haven't seen a panic, yet. :) Even it's just for an uptime from about 20 minutes, but without the patch the machine was panicking from within 2-3 minutes.
I have run my -CURRENT test system for a couple of days with the patch and haven't seen the kernel panic. This patch also solves sporadic system hangs when KCSAN is enabled.
A commit references this bug: Author: jah Date: Sat Sep 12 07:04:01 UTC 2020 New revision: 365652 URL: https://svnweb.freebsd.org/changeset/base/365652 Log: amd64: prevent KCSan false positives on LAPIC mapping For configurations without x2APIC support (guests, older hardware), the global LAPIC MMIO mapping will trigger false-positive KCSan reports as it will appear that multiple CPUs are concurrently reading and writing the same address. This isn't actually true, as the underlying physical access will be performed on the local CPU's APIC. Additionally, because LAPIC access can happen during event timer configuration, the resulting KCSan printf can produce a panic due to attempted recursion on event timer resources. Add a __nosanitizethread preprocessor define to prevent the compiler from inserting TSan hooks, and apply it to the x86 LAPIC accessors. PR: 249149 Reported by: gbe Reviewed by: andrew, kib Tested by: gbe Differential Revision: https://reviews.freebsd.org/D26354 Changes: head/sys/sys/cdefs.h head/sys/x86/x86/local_apic.c