Created attachment 219141 [details] Crashdump (26th October) Hi all, I am having a nightmare since migrating a NAS system to FreeBSD 12.2 based TrueNAS 12. The system, which had been running without issue on FreeBSD 11 for over a year, will now not stay up for more than a few hours at a time. In every instance the cause of the crash is 'spin lock held too long'. Annecodtally it seems to coincide with increased IO on BHYVE guest VMs, but this could be a red herring. Crash dump attached.
We're panicking because an idle thread is processing callouts and apparently taking too long. Unfortunately I can't see what it was actually doing. There aren't many users of C_DIRECT_EXEC but it's possible the thread is somehow stuck in a callout handler. It's possible that some memory corruption is causing callout_process() to loop infinitely on a callout list. I don't think it'll be easy to debug this without a vmcore: https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#kerneldebug-obtain
Thanks Mark. I've just completed a full MemTest86 on the system to rule out memory corruption, the server is fitted with ECC memory in any case. I have set the system to output full minidump core dumps. Just waiting for the next crash now...
Turns out TrueNAS does not output crash-dumps in the way I was expecting, I've has another 9 kernel panics since my last comment but all produced only text-dumps. I *think* I've got it setup to do a core-dump next crash now using sysctl to override the text-dump pending flag... so hopefully will have a vmcore for you soon. The panic string has been different the last few times: Dump header from device: /dev/ada1p3 Architecture: amd64 Architecture Version: 4 Dump Length: 1751040 Blocksize: 512 Compression: none Dumptime: Sun Nov 1 08:33:26 2020 Hostname: nas.lan Magic: FreeBSD Text Dump Version String: FreeBSD 12.2-RC3 7c4ec6ff02c(HEAD) TRUENAS Panic String: Bad tailq NEXT(0xffffffff82134d98->tqh_last) != NULL Dump Parity: 3497888606 Bounds: 3 Dump Status: good Another TrueNAS user is reporting similar issues and believes it may be related to the Intel 10Gb 'ix' drivers. Is that a possibility?
The message buffer seems to indicate the panic occurs in relation to interface ix1 coming 'up': <6>ix1: link state changed to UP panic: Bad tailq NEXT(0xffffffff82135158->tqh_last) != NULL cpuid = 3 time = 1604305215 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0075d697a0 vpanic() at vpanic+0x17b/frame 0xfffffe0075d697f0 panic() at panic+0x43/frame 0xfffffe0075d69850 callout_process() at callout_process+0x32f/frame 0xfffffe0075d698c0 handleevents() at handleevents+0x185/frame 0xfffffe0075d69900 timercb() at timercb+0x196/frame 0xfffffe0075d69950 lapic_handle_timer() at lapic_handle_timer+0x9b/frame 0xfffffe0075d69980 Xtimerint() at Xtimerint+0xb1/frame 0xfffffe0075d69980 --- interrupt, rip = 0xffffffff811af6e6, rsp = 0xfffffe0075d69a50, rbp = 0xfffffe0075d69a50 --- acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0075d69a50 acpi_cpu_idle() at acpi_cpu_idle+0x232/frame 0xfffffe0075d69aa0 cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0075d69ac0 cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0075d69ae0 sched_idletd() at sched_idletd+0x3f1/frame 0xfffffe0075d69bb0 fork_exit() at fork_exit+0x80/frame 0xfffffe0075d69bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0075d69bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic
Created attachment 219312 [details] Text Dump 2
(In reply to Seb from comment #4) That may just be the last line printed in a while. I would expect the interfaces to come up during boot, at which point they'll log messages. Indeed, from the second panic it looks like the callout lists are getting corrupted somehow.
By coincidence, syzkaller found a bug which manifests with exactly the same panic. I posted a patch for it here: https://reviews.freebsd.org/D27266 I'm not sure if this is what you're hitting, but it seems plausible if you have a lot of CPU load on the system.
Seb, I've built TrueNAS 12.0-RELEASE kernel with the Mark's patch: https://people.freebsd.org/~mav/kernel.tgz You may create boot environment just in case, unpack the archive into /boot and reboot. Let us know whether it help.
Hi folks, Truly appreciate all the help. I'm afraid how to get this Kernel running on my current TrueNAS system is a little beyong my current knowledge. Can you point me towards some more details steps?
1) In WebUI go to System -> Boot to create new boot environment in case something go wrong. 2) Copy the kernel.tgz archive somewhere on the NAS. 3) Get to command prompt via SSH or Shell menu on WebUI. 4) `cd /boot` 5) `mv kernel kernel.prev` 6) `tar -xzvf /path/to/the/uploaded/kernel.tgz` 7) Reboot the NAS and see what happen. If something go wrong, select the created boot environment in the boot menu.
Thank you! OK. It booted and it's up using the patched Kernel. Will need to give it a bit of time now. The longest I've seen between panics is about 48 hours, so I'll report back in a few days.
Coming up on 36 hours without a panic now. I will give it another few days to be sure but I'm confident this one is fixed. Thanks to all. I have no idea why my particular instance was so badly affectedby this bug, it is not a high-load system. The only thing 'unusual' about it is that it acts as a network video recorder for my camera system so it has a reasonable level of constant network and disk IO.
Sorry - meant to say 72 hours...
Thank you for the update. The patch is committed to FreeBSD head and I've merged it into upcoming TrueNAS 12.0-U1. Let us know how will it go further.
I merged the change to stable/12. I think we'll issue an EN for this as well. Please re-open if you have a reason to suspect that the problem still exists.