250667 – Frequent Kernel panic for spin lock on TrueNAS (FreeBSD 12.2)

Bug 250667 - Frequent Kernel panic for spin lock on TrueNAS (FreeBSD 12.2)

Summary: Frequent Kernel panic for spin lock on TrueNAS (FreeBSD 12.2)

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	Mark Johnston

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-10-27 09:56 UTC by Seb
Modified:	2020-11-26 19:52 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
Crashdump (26th October) (117.11 KB, application/x-zip-compressed) 2020-10-27 09:56 UTC, Seb	no flags	Details
Text Dump 2 (126.79 KB, application/x-gzip) 2020-11-02 21:40 UTC, Seb	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Seb 2020-10-27 09:56:53 UTC

Created attachment 219141 [details]
Crashdump (26th October)

Hi all,

I am having a nightmare since migrating a NAS system to FreeBSD 12.2 based TrueNAS 12.  

The system, which had been running without issue on FreeBSD 11 for over a year, will now not stay up for more than a few hours at a time.  In every instance the cause of the crash is 'spin lock held too long'. 

Annecodtally it seems to coincide with increased IO on BHYVE guest VMs, but this could be a red herring. 

Crash dump attached.

Comment 1 Mark Johnston freebsd_committer

2020-10-27 14:02:51 UTC

We're panicking because an idle thread is processing callouts and apparently taking too long.  Unfortunately I can't see what it was actually doing.  There aren't many users of C_DIRECT_EXEC but it's possible the thread is somehow stuck in a callout handler.  It's possible that some memory corruption is causing callout_process() to loop infinitely on a callout list.

I don't think it'll be easy to debug this without a vmcore: https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#kerneldebug-obtain

Comment 2 Seb 2020-10-27 16:13:51 UTC

Thanks Mark.
I've just completed a full MemTest86 on the system to rule out memory corruption, the server is fitted with ECC memory in any case. 

I have set the system to output full minidump core dumps.  Just waiting for the next crash now...

Comment 3 Seb 2020-11-02 21:13:34 UTC

Turns out TrueNAS does not output crash-dumps in the way I was expecting, I've has another 9 kernel panics since my last comment but all produced only text-dumps.  I *think* I've got it setup to do a core-dump next crash now using sysctl to override the text-dump pending flag... so hopefully will have a vmcore for you soon. 

The panic string has been different the last few times:

Dump header from device: /dev/ada1p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 1751040
  Blocksize: 512
  Compression: none
  Dumptime: Sun Nov  1 08:33:26 2020
  Hostname: nas.lan
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS
  Panic String: Bad tailq NEXT(0xffffffff82134d98->tqh_last) != NULL
  Dump Parity: 3497888606
  Bounds: 3
  Dump Status: good

Another TrueNAS user is reporting similar issues and believes it may be related to the Intel 10Gb 'ix' drivers. Is that a possibility?

Comment 4 Seb 2020-11-02 21:37:48 UTC

The message buffer seems to indicate the panic occurs in relation to interface ix1 coming 'up':

<6>ix1: link state changed to UP
panic: Bad tailq NEXT(0xffffffff82135158->tqh_last) != NULL
cpuid = 3
time = 1604305215
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0075d697a0
vpanic() at vpanic+0x17b/frame 0xfffffe0075d697f0
panic() at panic+0x43/frame 0xfffffe0075d69850
callout_process() at callout_process+0x32f/frame 0xfffffe0075d698c0
handleevents() at handleevents+0x185/frame 0xfffffe0075d69900
timercb() at timercb+0x196/frame 0xfffffe0075d69950
lapic_handle_timer() at lapic_handle_timer+0x9b/frame 0xfffffe0075d69980
Xtimerint() at Xtimerint+0xb1/frame 0xfffffe0075d69980
--- interrupt, rip = 0xffffffff811af6e6, rsp = 0xfffffe0075d69a50, rbp = 0xfffffe0075d69a50 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0075d69a50
acpi_cpu_idle() at acpi_cpu_idle+0x232/frame 0xfffffe0075d69aa0
cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0075d69ac0
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0075d69ae0
sched_idletd() at sched_idletd+0x3f1/frame 0xfffffe0075d69bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe0075d69bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0075d69bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic

Comment 5 Seb 2020-11-02 21:40:00 UTC

Created attachment 219312 [details]
Text Dump 2

Comment 6 Mark Johnston freebsd_committer

2020-11-02 22:56:41 UTC

(In reply to Seb from comment #4)
That may just be the last line printed in a while.  I would expect the interfaces to come up during boot, at which point they'll log messages.

Indeed, from the second panic it looks like the callout lists are getting corrupted somehow.

Comment 7 Mark Johnston freebsd_committer

2020-11-18 14:49:11 UTC

By coincidence, syzkaller found a bug which manifests with exactly the same panic.  I posted a patch for it here: https://reviews.freebsd.org/D27266

I'm not sure if this is what you're hitting, but it seems plausible if you have a lot of CPU load on the system.

Comment 8 Alexander Motin freebsd_committer

2020-11-18 18:45:07 UTC

Seb, I've built TrueNAS 12.0-RELEASE kernel with the Mark's patch: https://people.freebsd.org/~mav/kernel.tgz

You may create boot environment just in case, unpack the archive into /boot and reboot.  Let us know whether it help.

Comment 9 Seb 2020-11-19 09:48:37 UTC

Hi folks,

Truly appreciate all the help.  I'm afraid how to get this Kernel running on my current TrueNAS system is a little beyong my current knowledge.

Can you point me towards some more details steps?

Comment 10 Alexander Motin freebsd_committer

2020-11-19 14:15:05 UTC

1) In WebUI go to System -> Boot to create new boot environment in case something go wrong.
2) Copy the kernel.tgz archive somewhere on the NAS.
3) Get to command prompt via SSH or Shell menu on WebUI.
4) `cd /boot`
5) `mv kernel kernel.prev`
6) `tar -xzvf /path/to/the/uploaded/kernel.tgz`
7) Reboot the NAS and see what happen.  If something go wrong, select the created boot environment in the boot menu.

Comment 11 Seb 2020-11-19 16:06:34 UTC

Thank you!

OK.  It booted and it's up using the patched Kernel.  

Will need to give it a bit of time now.  The longest I've seen between panics is about 48 hours, so I'll report back in a few days.

Comment 12 Seb 2020-11-22 13:45:16 UTC

Coming up on 36 hours without a panic now.  I will give it another few days to be sure but I'm confident this one is fixed.  Thanks to all. 

I have no idea why my particular instance was so badly affectedby this bug, it is not a high-load system.  The only thing 'unusual' about it is that it acts as a network video recorder for my camera system so it has a reasonable level of constant network and disk IO.

Comment 13 Seb 2020-11-22 13:46:01 UTC

Sorry - meant to say 72 hours...

Comment 14 Alexander Motin freebsd_committer

2020-11-22 14:01:09 UTC

Thank you for the update.  The patch is committed to FreeBSD head and I've merged it into upcoming TrueNAS 12.0-U1.  Let us know how will it go further.

Comment 15 Mark Johnston freebsd_committer

2020-11-26 19:52:31 UTC

I merged the change to stable/12.  I think we'll issue an EN for this as well.  Please re-open if you have a reason to suspect that the problem still exists.