Bug 244878

Summary: Kernel panic due to spin lock held too long
Product: Base System Reporter: Adam Jimerson <vendion>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Only Me CC: douglas, emaste, jah, markj
Priority: --- Keywords: crash, needs-qa
Version: 12.1-RELEASE   
Hardware: amd64   
OS: Any   

Description Adam Jimerson 2020-03-18 01:05:55 UTC
Running FreeBSD 12.1-p2 with the GENERIC kernel my system encountered a kernel panic with a "spin lock held too long" error. My system was idle when this happened so I can't say that it was due to any specific thing. I do have a copy of the generated crash dumps though which I have uploaded. Other than the crash dumps I don't know what other information to provide, but will be happy to provide what ever information I can.

The following files:

[/u/h/vendion]─> ls -lh /var/crash
total 723608
-rw-r--r--  1 root  wheel     2B Mar 17 20:39 bounds
-rw-r--r--  1 root  wheel   287K Mar 17 20:39 core.txt.0
-rw-------  1 root  wheel   383B Mar 17 20:39 info.0
lrwxr-xr-x  1 root  wheel     6B Mar 17 20:39 info.last@ -> info.0
-rw-r--r--  1 root  wheel     5B Nov  5 20:16 minfree
-rw-------  1 root  wheel   1.8G Mar 17 20:39 vmcore.0
lrwxr-xr-x  1 root  wheel     8B Mar 17 20:39 vmcore.last@ -> vmcore.0

I tared up and can be downloaded at https://drive.google.com/file/d/1o3uZKfPbiU-b2xJAUkNKHCzuU3qB9sH9/view?usp=sharing
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2020-03-18 01:11:56 UTC
Thank you for the report Adam

Is there a panic summary and backtrace provided on the console at the time of the crash?
Comment 2 Adam Jimerson 2020-03-18 01:20:47 UTC
The best I can find is the following from /var/log/messages

Limiting closed port RST response from 246 to 200 packets/sec
spin lock 0xffffffff8200cc50 (smp rendezvous) held by 0xfffff8024555e000 (tid 101747) too long
timeout stopping cpus
panic: spin lock held too long
cpuid = 0
time = 1584487103
KDB: stack backtrace:
#0 0xffffffff80c1d2b7 at kdb_backtrace+0x67
#1 0xffffffff80bd05ed at vpanic+0x19d
#2 0xffffffff80bd0443 at panic+0x43
#3 0xffffffff80bb09ad at _mtx_lock_indefinite_check+0x6d
#4 0xffffffff80bb04f5 at _mtx_lock_spin_cookie+0xd5
#5 0xffffffff80c2d539 at smp_rendezvous_cpus+0x239
#6 0xffffffff80bcb071 at _rm_wlock+0x171
#7 0xffffffff80cf8d48 at rt_foreach_fib_walk+0xa8
#8 0xffffffff80e21a37 at in6_mtutimo+0x47
#9 0xffffffff80bea743 at softclock_call_cc+0x143
#10 0xffffffff80beac09 at softclock+0x79
#11 0xffffffff80b93dd4 at ithread_loop+0x1d4
#12 0xffffffff80b90c43 at fork_exit+0x83
#13 0xffffffff81082a6e at fork_trampoline+0xe
Uptime: 7h42m4s

Everything before that looks normal to me.
Comment 3 Jason A. Harmening freebsd_committer freebsd_triage 2020-03-24 23:15:12 UTC
Not a lot to go on from the backtrace.   One thing that can cause this panic is if a thread calls smp_rendezvous_cpus() with local interrupts disabled: if another thread holds smp_ipi_mtx and is trying to IPI the first thread's CPU, the two threads can livelock.

It doesn't look like the timer callout+in6 route expiration path that panicked should ever call smp_rendezvous_cpus() with interrupts disabled, but I think it's also possible that this thread could have been the victim of two other threads that were livelocked.

r357308 added a check to HEAD (INVARIANTS kernels only) that panics if a thread enters smp_rendezvous_cpus() with interrupts disabled.  If this is at all reproducible, it might be worth trying a kernel with that check backported.

Also worth noting: in6_mtutimo() was completely removed from HEAD in 355888.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2020-06-15 21:39:46 UTC
Is the kernel (12.1-p2) still available?  Could you post it, so the core can be inspected?
Comment 5 Adam Jimerson 2020-06-16 01:26:58 UTC
I don't have a 12.1-p2 kernel anymore, my current kernel is 12.1-p6 and I have not had any spin locks since reporting this issue. The core.txt and vmcore.o files from when the crash happened are still avaliable though if those will help (see the Google Drive link in my original comment here for the link).
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2020-06-16 18:53:22 UTC
(In reply to Adam Jimerson from comment #5)
Indeed, we need a copy of the kernel in order to open the vmcore.  It doesn't seem to be straightforward to get a copy of the 12.1p2 kernel once additional patches have been released, which is itself a problem but also means that we can't easily debug the crash.

I will close the bug for now since the crash doesn't appear reproducible.  Please re-open if it happens again.
Comment 7 Adam Jimerson 2020-06-27 11:30:53 UTC
Leaving this closed for now as for some reason the crash dumps either did not get exported/saved so don't know how useful this is but I did see that this happened again yesterday and all I have to go off of is the stacktrace from /var/log/messages:

Jun 26 21:38:42 Heimdall syslogd: kernel boot file is /boot/kernel/kernel
Jun 26 21:38:42 Heimdall kernel: spin lock 0xffffffff8200cc50 (smp rendezvous) held by 0xfffff80214eb55e0 (tid 101663) too long
Jun 26 21:38:42 Heimdall kernel: timeout stopping cpus
Jun 26 21:38:42 Heimdall kernel: panic: spin lock held too long
Jun 26 21:38:42 Heimdall kernel: cpuid = 0
Jun 26 21:38:42 Heimdall kernel: time = 1593001188
Jun 26 21:38:42 Heimdall kernel: KDB: stack backtrace:
Jun 26 21:38:42 Heimdall kernel: #0 0xffffffff80c1d307 at kdb_backtrace+0x67
Jun 26 21:38:42 Heimdall kernel: #1 0xffffffff80bd063d at vpanic+0x19d
Jun 26 21:38:42 Heimdall kernel: #2 0xffffffff80bd0493 at panic+0x43
Jun 26 21:38:42 Heimdall kernel: #3 0xffffffff80bb09fd at _mtx_lock_indefinite_check+0x6d
Jun 26 21:38:42 Heimdall kernel: #4 0xffffffff80bb0545 at _mtx_lock_spin_cookie+0xd5
Jun 26 21:38:42 Heimdall kernel: #5 0xffffffff8121fb0e at smp_targeted_tlb_shootdown+0x3ce
Jun 26 21:38:42 Heimdall kernel: #6 0xffffffff8109c2af at pmap_advise+0x57f
Jun 26 21:38:42 Heimdall kernel: #7 0xffffffff80f16214 at vm_map_madvise+0x354
Jun 26 21:38:42 Heimdall kernel: #8 0xffffffff80f1c259 at sys_madvise+0x79
Jun 26 21:38:42 Heimdall kernel: #9 0xffffffff810a8984 at amd64_syscall+0x364
Jun 26 21:38:42 Heimdall kernel: #10 0xffffffff81082520 at fast_syscall_common+0x101
Jun 23 21:42:25 Heimdall pulseaudio[46631]: [(null)] oss-util.c: '/dev/dsp2' doesn't support full duplex
Jun 23 21:42:25 Heimdall pulseaudio[46631]: [(null)] oss-util.c: '/dev/dsp3' doesn't support full duplex
Jun 23 21:42:25 Heimdall pulseaudio[46631]: [(null)] oss-util.c: '/dev/dsp4' doesn't support full duplex
Jun 23 21:42:25 Heimdall pulseaudio[46631]: [(null)] oss-util.c: '/dev/dsp5' doesn't support full duplex
Jun 23 21:42:25 Heimdall pulseaudio[46631]: [(null)] oss-util.c: '/dev/dsp8' doesn't support full duplex
Jun 26 21:38:42 Heimdall syslogd: kernel boot file is /boot/kernel/kernel
Jun 26 21:38:42 Heimdall kernel: spin lock 0xffffffff8200cc50 (smp rendezvous) held by 0xfffff80214eb55e0 (tid 101663) too long
Jun 26 21:38:42 Heimdall kernel: timeout stopping cpus
Jun 26 21:38:42 Heimdall kernel: panic: spin lock held too long
Jun 26 21:38:42 Heimdall kernel: cpuid = 0
Jun 26 21:38:42 Heimdall kernel: time = 1593001188
Jun 26 21:38:42 Heimdall kernel: KDB: stack backtrace:
Jun 26 21:38:42 Heimdall kernel: #0 0xffffffff80c1d307 at kdb_backtrace+0x67
Jun 26 21:38:42 Heimdall kernel: #1 0xffffffff80bd063d at vpanic+0x19d
Jun 26 21:38:42 Heimdall kernel: #2 0xffffffff80bd0493 at panic+0x43
Jun 26 21:38:42 Heimdall kernel: #3 0xffffffff80bb09fd at _mtx_lock_indefinite_check+0x6d
Jun 26 21:38:42 Heimdall kernel: #4 0xffffffff80bb0545 at _mtx_lock_spin_cookie+0xd5
Jun 26 21:38:42 Heimdall kernel: #5 0xffffffff8121fb0e at smp_targeted_tlb_shootdown+0x3ce
Jun 26 21:38:42 Heimdall kernel: #6 0xffffffff8109c2af at pmap_advise+0x57f
Jun 26 21:38:42 Heimdall kernel: #7 0xffffffff80f16214 at vm_map_madvise+0x354
Jun 26 21:38:42 Heimdall kernel: #8 0xffffffff80f1c259 at sys_madvise+0x79
Jun 26 21:38:42 Heimdall kernel: #9 0xffffffff810a8984 at amd64_syscall+0x364
Jun 26 21:38:42 Heimdall kernel: #10 0xffffffff81082520 at fast_syscall_common+0x101

I can upload the current kernel I'm using (FreeBSD 12.1-p6 GENERIC) if need be, but if it happens again and I get crash dumps for it I'll reopen.
Comment 8 Adam Jimerson 2020-06-29 10:34:09 UTC
Okay so this morning I had another spin lock, and it was able to create the crash dump. The stacktrace info from /var/log/messages, the crash dumps from /var/crash and my current kernel are all available here https://drive.google.com/file/d/1LtXPg3XarwWxAwTZuzD3ug6oY_QGJUT9/view?usp=sharing
Comment 9 Andriy Gapon freebsd_committer freebsd_triage 2020-06-30 08:39:51 UTC
What is your hardware?
Comment 10 Adam Jimerson 2020-06-30 11:09:15 UTC
(In reply to Andriy Gapon from comment #9)

My hardware info is available at https://bsd-hardware.info/?probe=b955e8c3bc that seems to list all my hardware including attached USB devices.
Comment 11 Adam Jimerson 2020-12-16 06:28:39 UTC
I would like to add that this is still happening periodically with 12.2-RELEASE-p2 GENERIC although the latest core dump did not get saved as it was too large for the swap space I have available on my system.
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2020-12-28 14:46:34 UTC
(In reply to Adam Jimerson from comment #11)
Could you try enabling compression on the core dump?  Add -Z to the dumpon(8) invocation, e.g., by adding dumpon_flags="-Z" to /etc/rc.conf.
Comment 13 Adam Jimerson 2020-12-28 16:30:08 UTC
(In reply to Mark Johnston from comment #12)

I certainly can enable that and see if that helps.
Comment 14 Adam Jimerson 2021-01-03 12:52:19 UTC
Okay so this happened again last night, I have the contents of /var/crash /boot and the dmesg output all avalable here https://drive.google.com/file/d/1524VJIumcrcGKQNPXmtP5_4g4B0MjHf9/view?usp=sharing
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2021-01-03 16:28:25 UTC
(In reply to Adam Jimerson from comment #14)
Thanks.  Unfortunately I need a copy of /usr/lib/debug/boot/kernel as well.  Could you provide a copy?
Comment 16 Adam Jimerson 2021-01-04 02:31:33 UTC
(In reply to Mark Johnston from comment #15)

Certanly, here is the contents of that directory https://drive.google.com/file/d/1WyqHAXzOsF9-EELo9f9ovr5Uc1KvQ3GV/view?usp=sharing