Bug 254215 - TrueNAS core kernel panic jails with 10G high network load
Summary: TrueNAS core kernel panic jails with 10G high network load
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2021-03-11 10:36 UTC by Rudolf Schuba
Modified: 2022-10-12 01:54 UTC (History)
1 user (show)

See Also:


Attachments
kernel panic (377.30 KB, image/jpeg)
2021-03-11 10:36 UTC, Rudolf Schuba
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rudolf Schuba 2021-03-11 10:36:14 UTC
Created attachment 223180 [details]
kernel panic

TrueNAS core kernel panic due to jails with high network loadMy TrueNAS core gets a kernel panic when I start backups from Proxmox over the 10G line. After analysis I found that it doesn't happen when I turn off the two jails! The kernel panic is new when jails are running. You can see this message on the console:



The server then restarts.

Here are the technical data of the system:

Supermicro server

TrueNAS-12.0-U2.1
2x Intel (R) Xeon (R) CPU E5-2620 0 @ 2.00GHz
64GB RAM
2x HP H220 HBA
10x WD RED 4TB
1x DUAL 10G Chelsio network card

2 jails, plex and zone minder
Comment 1 Benoit Le Bourhis 2021-04-01 23:00:28 UTC
I have the same behaviors going on since my upgrade to TrueNAS 12.0 last October.  But I couldn't put my finger on it.  Every few days, even after a few hours, my server would stall...

My setup is neat and simple.  I suspected all components. Swapped board, CPU, RAM, NIC (SPF+ 10G and Intel 1G), HBA, cables... everything with a second "sleeping" computer 

Only the HDD were remaining and one Plex jails...

Right before jumping off a bridge I decided to stop the Plex Jail for a few days.  To my surprise, the server has been rock solid, transferring 650GB on the network many times over 4 days.

Today, I decided to recreate the Plex Jail. In the process, I had to start/stop it a few time for configuration purposes and voilà! Bang! Everything hangs!

The jails has been running for less than 10 minutes :(

Lucky me I could grab a picture of a partial error message :

"Apr 1 18:01:04 corpus kernel: mlx4_en mlx4_core0: Internal error detected
arting device"

Despite the date, this is not an April Fool.  It's a victory finding out the jail was the problem!!

For you to know, this server was running perfectly since the days of FreeNAS 9.x and upgraded to FreeNAS 11 over the years up to 11.3-U5 without a hick.

I now have two options: reinstall 11.3U5 and rebuild all the ZFS pool or scrap all jails and convert Plex to a virtual machine within TrueNAS.  What a nice weekend ahead :)
Comment 2 Benoit Le Bourhis 2022-10-12 01:54:01 UTC
Here is an update to this problem I solved a while ago:  I swapped the 10G NIC for another one I had (Intel X520DA-2) and the problem is gone.  I am pretty sure my previous NIC was "getting" bad / defect.