Created attachment 211341 [details]
stack trace dump from the iDRAC virtual console
One of our Dell R540 servers have period issue since few months ago put in production. Eventually it reboots every ~25 days with not minidumps or messages in logs. The suspicion was the motherboard and it had been replaced by Dell before Xmas so it helped to keep the server running for more then 40 days and then it happen again, and than again in less than 2 days.
Yesterday the server stopped responding and after quick glance at iDRAC virtual console it reveal the panic screen, see attached screenshot. Unfortunately I failed to scroll the screen up because of iDRAC virtual console.
The swap is 24Gb and Dumping stalled for 10-15 so I had to cold reboot because of no actions.
Any ideas if it a hardware or software issue?
The system has been upgraded to 11.3-RELEASE-p6 recently. It is Dell R540 with 128Gb RAM, Dell BOSS NVME RAID0, H730P raid controller with 12 JBOD disks + HBA connected MD1400 via multipath 12 disks. 2 VDEVs ZFS pool, the system is on UFS partition on M.2/BOSS flash disk.
The system is 100% NAS with samba 4.10, so jails or VMs or active users.
Appreciate any ideas how to debug or diagnose the issue.
Got a panic again ... switched back to 11.3-RELEASE-p5
Created attachment 211408 [details]
new panic around day after
The stack trace slightly different.
Still not sure why it does not dump to swap. The swap is 24G so it is enough for minidumps. Does it mean the kernel lost ada0 device with all partitions on panic?
The issue was a hardware problem with Dell 12Gbps HBA adapter:
mpr0@pci0:1:0:0: class=0x010700 card=0x1f461028 chip=0x00971000 rev=0x02 hdr=0x00
vendor = 'LSI Logic / Symbios Logic'
device = 'SAS3008 PCI-Express Fusion-MPT SAS-3'
class = mass storage
subclass = SAS
The adapter got swapped with Dell H830 adapter from a spare server and it works with no issues since than. The uptime is 93days and it is the longest period ever.
I had installed HBA to an aging Dell R330 server and after some time started to receive numerous diagnostics reports about faulty power supply module. Rebooting, firmware updating and running diagnostics did not help to solve the issue and Dell agreed to replace the motherboard with one of the power supplies and HBA. Since then no issues detected.
My conclusion, the original Dell 12Gbps HBA was faulty and created problems in both servers.
The report may be closed because it is a hardware issue.