Bug 249149 - panic: -CURRENT with KCSAN option -> panic: mtx_lock_spin: recursed on non-recursive mutex et_hw_mtx @ src/sys/kern/kern_clocksource.c:850
Summary: panic: -CURRENT with KCSAN option -> panic: mtx_lock_spin: recursed on non-re...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Jason A. Harmening
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2020-09-06 14:31 UTC by Gordon Bergling
Modified: 2020-09-13 19:56 UTC (History)
2 users (show)

See Also:


Attachments
-CURRENT dmesg with KCSAN messages and the kernel panic (39.17 KB, text/plain)
2020-09-06 14:31 UTC, Gordon Bergling
no flags Details
preliminary patch (1.76 KB, patch)
2020-09-07 06:27 UTC, Jason A. Harmening
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gordon Bergling freebsd_committer 2020-09-06 14:31:55 UTC
Created attachment 217791 [details]
-CURRENT dmesg with KCSAN messages and the kernel panic

On a virtualized amd64 VM (Hyper-V) of -CURRENT r365373 the system panics reproducible with a KCSAN enabled kernel.

The KERNCONF is the following:
include GENERIC
ident GENERIC-SANTI
options		KCSAN
options         RATELIMIT
options         TCPHPTS
options     	KERN_TLS

The full panic message with the stacktrace is the following
panic: mtx_lock_spin: recursed on non-recursive mutex et_hw_mtx @ /boiler/nfs/src/sys/kern/kern_clocksource.c:850

cpuid = 0
time = 1599396342
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x34/frame 0xfffffe0006e74030
kdb_backtrace() at kdb_backtrace+0x6a/frame 0xfffffe0006e740e0
vpanic() at vpanic+0x229/frame 0xfffffe0006e74130
panic() at panic+0x53/frame 0xfffffe0006e74190
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie/frame 0xfffffe0006e741f0
cpu_new_callout() at cpu_new_callout+0xc5/frame 0xfffffe0006e74240
callout_cc_add() at callout_cc_add+0x2aa/frame 0xfffffe0006e742a0
callout_reset_sbt_on() at callout_reset_sbt_on+0x5ce/frame 0xfffffe0006e74340
callout_schedule() at callout_schedule+0x76/frame 0xfffffe0006e74380
vtterm_post_input() at vtterm_post_input+0xa6/frame 0xfffffe0006e743a0
termteken_post_input() at termteken_post_input+0x2e/frame 0xfffffe0006e743c0
teken_input() at teken_input+0x1f8/frame 0xfffffe0006e74420
termcn_cnputc() at termcn_cnputc+0xa7/frame 0xfffffe0006e74460
cnputc() at cnputc+0x80/frame 0xfffffe0006e744a0
cnputs() at cnputs+0x90/frame 0xfffffe0006e744d0
putchar() at putchar+0x236/frame 0xfffffe0006e74580
kvprintf() at kvprintf+0xfa7/frame 0xfffffe0006e746d0
vprintf() at vprintf+0x9b/frame 0xfffffe0006e747b0
printf() at printf+0x53/frame 0xfffffe0006e74810
kcsan_access() at kcsan_access+0x19b/frame 0xfffffe0006e748f0
lapic_et_start() at lapic_et_start+0x628/frame 0xfffffe0006e74940
et_start() at et_start+0xdd/frame 0xfffffe0006e74980
loadtimer() at loadtimer+0x1ef/frame 0xfffffe0006e749d0
handleevents() at handleevents+0x401/frame 0xfffffe0006e74a30
cpu_activeclock() at cpu_activeclock+0xd0/frame 0xfffffe0006e74a70
cpu_idle() at cpu_idle+0xea/frame 0xfffffe0006e74aa0
sched_idletd() at sched_idletd+0x5de/frame 0xfffffe0006e74bb0
fork_exit() at fork_exit+0xaa/frame 0xfffffe0006e74bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0006e74bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Comment 1 Jason A. Harmening freebsd_committer 2020-09-07 06:27:19 UTC
Created attachment 217796 [details]
preliminary patch
Comment 2 Jason A. Harmening freebsd_committer 2020-09-07 06:34:33 UTC
It looks like this is a KCSan false positive on the LAPIC MMIO mapping.  Since this is a guest that doesn't use the x2APIC (and therefore must use MMIO for LAPIC access instead of MSRs), all CPUs will share the same LAPIC mapping.  However, the underlying physical access will be routed to the local CPU's APIC, so multiple CPUs aren't actually racing on the same hardware resource.   Or at least I hope that's the source of the problem, otherwise Hyper-V really misconfigured the guest.

Since the LAPIC access happens in the course of configuring an event timer, and the KCSan warning printf itself tries to configure a callout, we get the panic.

Can you try the attached (build-tested only) patch?
Comment 3 Gordon Bergling freebsd_committer 2020-09-07 15:18:41 UTC
(In reply to Jason A. Harmening from comment #2)

I won't think, that the Hyper-V virtualization is the problem here. I applied your patch to my working tree and I haven't seen a panic, yet. :)

Even it's just for an uptime from about 20 minutes, but without the patch the machine was panicking from within 2-3 minutes.
Comment 4 Gordon Bergling freebsd_committer 2020-09-11 16:22:02 UTC
I have run my -CURRENT test system for a couple of days with the patch and haven't seen the kernel panic. This patch also solves sporadic system hangs when KCSAN is enabled.
Comment 5 commit-hook freebsd_committer 2020-09-12 07:04:15 UTC
A commit references this bug:

Author: jah
Date: Sat Sep 12 07:04:01 UTC 2020
New revision: 365652
URL: https://svnweb.freebsd.org/changeset/base/365652

Log:
  amd64: prevent KCSan false positives on LAPIC mapping

  For configurations without x2APIC support (guests, older hardware), the global
  LAPIC MMIO mapping will trigger false-positive KCSan reports as it will appear
  that multiple CPUs are concurrently reading and writing the same address.
  This isn't actually true, as the underlying physical access will be performed
  on the local CPU's APIC. Additionally, because LAPIC access can happen during
  event timer configuration, the resulting KCSan printf can produce a panic due
  to attempted recursion on event timer resources.

  Add a __nosanitizethread preprocessor define to prevent the compiler from
  inserting TSan hooks, and apply it to the x86 LAPIC accessors.

  PR:		249149
  Reported by:	gbe
  Reviewed by:	andrew, kib
  Tested by:	gbe
  Differential Revision:	https://reviews.freebsd.org/D26354

Changes:
  head/sys/sys/cdefs.h
  head/sys/x86/x86/local_apic.c