Bug 243867

Summary: CAM NULL pointer dereference on Dell R540: pass_add_physpath -> xpt_getattr -> scsi_action
Product: Base System Reporter: Oleg Cherkasov <oleg.cherkasov>
Component: kernAssignee: freebsd-scsi mailing list <scsi>
Status: New ---    
Severity: Affects Only Me Keywords: panic
Priority: ---    
Version: 11.3-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
stack trace dump from the iDRAC virtual console
none
new panic around day after none

Description Oleg Cherkasov 2020-02-04 11:32:30 UTC
Created attachment 211341 [details]
stack trace dump from the iDRAC virtual console

Hi,

One of our Dell R540 servers have period issue since few months ago put in production.  Eventually it reboots every ~25 days with not minidumps or messages in logs.  The suspicion was the motherboard and it had been replaced by Dell before Xmas so it helped to keep the server running for more then 40 days and then it happen again, and than again in less than 2 days.

Yesterday the server stopped responding and after quick glance at iDRAC virtual console it reveal the panic screen, see attached screenshot.  Unfortunately I failed to scroll the screen up because of iDRAC virtual console.

The swap is 24Gb and Dumping stalled for 10-15 so I had to cold reboot because of no actions.

Any ideas if it a hardware or software issue?

The system has been upgraded to 11.3-RELEASE-p6 recently.  It is Dell R540 with 128Gb RAM, Dell BOSS NVME RAID0, H730P raid controller with 12 JBOD disks + HBA connected MD1400 via multipath 12 disks.  2 VDEVs ZFS pool, the system is on UFS partition on M.2/BOSS flash disk.

/boot/device.hints:

hw.mfi.mrsas_enable="1"


/boot/loader.conf.local:

vm.kmem_size_max=130000000000
vm.kmem_size=130000000000
vfs.zfs.arc_max=128000000000
vm.pmap.pti=0
hw.ibrs_disable=1
hw.spec_store_bypass_disable=1
geom_multipath_load="YES"


The system is 100% NAS with samba 4.10, so jails or VMs or active users.

Appreciate any ideas how to debug or diagnose the issue.
Comment 1 Oleg Cherkasov 2020-02-04 20:02:02 UTC
Got a panic again ... switched back to 11.3-RELEASE-p5
Comment 2 Oleg Cherkasov 2020-02-06 09:43:41 UTC
Created attachment 211408 [details]
new panic around day after

11.3-RELEASE-p5

The stack trace slightly different.

Still not sure why it does not dump to swap.  The swap is 24G so it is enough for minidumps.  Does it mean the kernel lost ada0 device with all partitions on panic?