Created attachment 219141 [details]
Crashdump (26th October)
I am having a nightmare since migrating a NAS system to FreeBSD 12.2 based TrueNAS 12.
The system, which had been running without issue on FreeBSD 11 for over a year, will now not stay up for more than a few hours at a time. In every instance the cause of the crash is 'spin lock held too long'.
Annecodtally it seems to coincide with increased IO on BHYVE guest VMs, but this could be a red herring.
Crash dump attached.
We're panicking because an idle thread is processing callouts and apparently taking too long. Unfortunately I can't see what it was actually doing. There aren't many users of C_DIRECT_EXEC but it's possible the thread is somehow stuck in a callout handler. It's possible that some memory corruption is causing callout_process() to loop infinitely on a callout list.
I don't think it'll be easy to debug this without a vmcore: https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#kerneldebug-obtain
I've just completed a full MemTest86 on the system to rule out memory corruption, the server is fitted with ECC memory in any case.
I have set the system to output full minidump core dumps. Just waiting for the next crash now...
Turns out TrueNAS does not output crash-dumps in the way I was expecting, I've has another 9 kernel panics since my last comment but all produced only text-dumps. I *think* I've got it setup to do a core-dump next crash now using sysctl to override the text-dump pending flag... so hopefully will have a vmcore for you soon.
The panic string has been different the last few times:
Dump header from device: /dev/ada1p3
Architecture Version: 4
Dump Length: 1751040
Dumptime: Sun Nov 1 08:33:26 2020
Magic: FreeBSD Text Dump
Version String: FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS
Panic String: Bad tailq NEXT(0xffffffff82134d98->tqh_last) != NULL
Dump Parity: 3497888606
Dump Status: good
Another TrueNAS user is reporting similar issues and believes it may be related to the Intel 10Gb 'ix' drivers. Is that a possibility?
The message buffer seems to indicate the panic occurs in relation to interface ix1 coming 'up':
<6>ix1: link state changed to UP
panic: Bad tailq NEXT(0xffffffff82135158->tqh_last) != NULL
cpuid = 3
time = 1604305215
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0075d697a0
vpanic() at vpanic+0x17b/frame 0xfffffe0075d697f0
panic() at panic+0x43/frame 0xfffffe0075d69850
callout_process() at callout_process+0x32f/frame 0xfffffe0075d698c0
handleevents() at handleevents+0x185/frame 0xfffffe0075d69900
timercb() at timercb+0x196/frame 0xfffffe0075d69950
lapic_handle_timer() at lapic_handle_timer+0x9b/frame 0xfffffe0075d69980
Xtimerint() at Xtimerint+0xb1/frame 0xfffffe0075d69980
--- interrupt, rip = 0xffffffff811af6e6, rsp = 0xfffffe0075d69a50, rbp = 0xfffffe0075d69a50 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0075d69a50
acpi_cpu_idle() at acpi_cpu_idle+0x232/frame 0xfffffe0075d69aa0
cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0075d69ac0
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0075d69ae0
sched_idletd() at sched_idletd+0x3f1/frame 0xfffffe0075d69bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe0075d69bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0075d69bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Created attachment 219312 [details]
Text Dump 2
(In reply to Seb from comment #4)
That may just be the last line printed in a while. I would expect the interfaces to come up during boot, at which point they'll log messages.
Indeed, from the second panic it looks like the callout lists are getting corrupted somehow.
By coincidence, syzkaller found a bug which manifests with exactly the same panic. I posted a patch for it here: https://reviews.freebsd.org/D27266
I'm not sure if this is what you're hitting, but it seems plausible if you have a lot of CPU load on the system.
Seb, I've built TrueNAS 12.0-RELEASE kernel with the Mark's patch: https://people.freebsd.org/~mav/kernel.tgz
You may create boot environment just in case, unpack the archive into /boot and reboot. Let us know whether it help.
Truly appreciate all the help. I'm afraid how to get this Kernel running on my current TrueNAS system is a little beyong my current knowledge.
Can you point me towards some more details steps?
1) In WebUI go to System -> Boot to create new boot environment in case something go wrong.
2) Copy the kernel.tgz archive somewhere on the NAS.
3) Get to command prompt via SSH or Shell menu on WebUI.
4) `cd /boot`
5) `mv kernel kernel.prev`
6) `tar -xzvf /path/to/the/uploaded/kernel.tgz`
7) Reboot the NAS and see what happen. If something go wrong, select the created boot environment in the boot menu.
OK. It booted and it's up using the patched Kernel.
Will need to give it a bit of time now. The longest I've seen between panics is about 48 hours, so I'll report back in a few days.
Coming up on 36 hours without a panic now. I will give it another few days to be sure but I'm confident this one is fixed. Thanks to all.
I have no idea why my particular instance was so badly affectedby this bug, it is not a high-load system. The only thing 'unusual' about it is that it acts as a network video recorder for my camera system so it has a reasonable level of constant network and disk IO.
Sorry - meant to say 72 hours...
Thank you for the update. The patch is committed to FreeBSD head and I've merged it into upcoming TrueNAS 12.0-U1. Let us know how will it go further.
I merged the change to stable/12. I think we'll issue an EN for this as well. Please re-open if you have a reason to suspect that the problem still exists.