Bug 243867 - CAM NULL pointer dereference on Dell R540: pass_add_physpath -> xpt_getattr -> scsi_action
Summary: CAM NULL pointer dereference on Dell R540: pass_add_physpath -> xpt_getattr -...
Status: Closed Unable to Reproduce
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-scsi (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2020-02-04 11:32 UTC by Oleg Cherkasov
Modified: 2022-10-12 00:49 UTC (History)
0 users

See Also:


Attachments
stack trace dump from the iDRAC virtual console (315.51 KB, image/png)
2020-02-04 11:32 UTC, Oleg Cherkasov
no flags Details
new panic around day after (25.48 KB, image/png)
2020-02-06 09:43 UTC, Oleg Cherkasov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Cherkasov 2020-02-04 11:32:30 UTC
Created attachment 211341 [details]
stack trace dump from the iDRAC virtual console

Hi,

One of our Dell R540 servers have period issue since few months ago put in production.  Eventually it reboots every ~25 days with not minidumps or messages in logs.  The suspicion was the motherboard and it had been replaced by Dell before Xmas so it helped to keep the server running for more then 40 days and then it happen again, and than again in less than 2 days.

Yesterday the server stopped responding and after quick glance at iDRAC virtual console it reveal the panic screen, see attached screenshot.  Unfortunately I failed to scroll the screen up because of iDRAC virtual console.

The swap is 24Gb and Dumping stalled for 10-15 so I had to cold reboot because of no actions.

Any ideas if it a hardware or software issue?

The system has been upgraded to 11.3-RELEASE-p6 recently.  It is Dell R540 with 128Gb RAM, Dell BOSS NVME RAID0, H730P raid controller with 12 JBOD disks + HBA connected MD1400 via multipath 12 disks.  2 VDEVs ZFS pool, the system is on UFS partition on M.2/BOSS flash disk.

/boot/device.hints:

hw.mfi.mrsas_enable="1"


/boot/loader.conf.local:

vm.kmem_size_max=130000000000
vm.kmem_size=130000000000
vfs.zfs.arc_max=128000000000
vm.pmap.pti=0
hw.ibrs_disable=1
hw.spec_store_bypass_disable=1
geom_multipath_load="YES"


The system is 100% NAS with samba 4.10, so jails or VMs or active users.

Appreciate any ideas how to debug or diagnose the issue.
Comment 1 Oleg Cherkasov 2020-02-04 20:02:02 UTC
Got a panic again ... switched back to 11.3-RELEASE-p5
Comment 2 Oleg Cherkasov 2020-02-06 09:43:41 UTC
Created attachment 211408 [details]
new panic around day after

11.3-RELEASE-p5

The stack trace slightly different.

Still not sure why it does not dump to swap.  The swap is 24G so it is enough for minidumps.  Does it mean the kernel lost ada0 device with all partitions on panic?
Comment 3 Oleg Cherkasov 2020-06-04 12:22:23 UTC
The issue was a hardware problem with Dell 12Gbps HBA adapter:

mpr0@pci0:1:0:0:	class=0x010700 card=0x1f461028 chip=0x00971000 rev=0x02 hdr=0x00
    vendor     = 'LSI Logic / Symbios Logic'
    device     = 'SAS3008 PCI-Express Fusion-MPT SAS-3'
    class      = mass storage
    subclass   = SAS

The adapter got swapped with Dell H830 adapter from a spare server and it works with no issues since than.  The uptime is 93days and it is the longest period ever.

I had installed HBA to an aging Dell R330 server and after some time started to receive numerous diagnostics reports about faulty power supply module.  Rebooting, firmware updating and running diagnostics did not help to solve the issue and Dell agreed to replace the motherboard with one of the power supplies and HBA.  Since then no issues detected.

My conclusion, the original Dell 12Gbps HBA was faulty and created problems in both servers.

The report may be closed because it is a hardware issue.