Bug 243867 - CAM NULL pointer dereference on Dell R540: pass_add_physpath -> xpt_getattr -> scsi_action
Summary: CAM NULL pointer dereference on Dell R540: pass_add_physpath -> xpt_getattr -...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-scsi mailing list
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2020-02-04 11:32 UTC by Oleg Cherkasov
Modified: 2020-02-06 09:43 UTC (History)
0 users

See Also:


Attachments
stack trace dump from the iDRAC virtual console (315.51 KB, image/png)
2020-02-04 11:32 UTC, Oleg Cherkasov
no flags Details
new panic around day after (25.48 KB, image/png)
2020-02-06 09:43 UTC, Oleg Cherkasov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Cherkasov 2020-02-04 11:32:30 UTC
Created attachment 211341 [details]
stack trace dump from the iDRAC virtual console

Hi,

One of our Dell R540 servers have period issue since few months ago put in production.  Eventually it reboots every ~25 days with not minidumps or messages in logs.  The suspicion was the motherboard and it had been replaced by Dell before Xmas so it helped to keep the server running for more then 40 days and then it happen again, and than again in less than 2 days.

Yesterday the server stopped responding and after quick glance at iDRAC virtual console it reveal the panic screen, see attached screenshot.  Unfortunately I failed to scroll the screen up because of iDRAC virtual console.

The swap is 24Gb and Dumping stalled for 10-15 so I had to cold reboot because of no actions.

Any ideas if it a hardware or software issue?

The system has been upgraded to 11.3-RELEASE-p6 recently.  It is Dell R540 with 128Gb RAM, Dell BOSS NVME RAID0, H730P raid controller with 12 JBOD disks + HBA connected MD1400 via multipath 12 disks.  2 VDEVs ZFS pool, the system is on UFS partition on M.2/BOSS flash disk.

/boot/device.hints:

hw.mfi.mrsas_enable="1"


/boot/loader.conf.local:

vm.kmem_size_max=130000000000
vm.kmem_size=130000000000
vfs.zfs.arc_max=128000000000
vm.pmap.pti=0
hw.ibrs_disable=1
hw.spec_store_bypass_disable=1
geom_multipath_load="YES"


The system is 100% NAS with samba 4.10, so jails or VMs or active users.

Appreciate any ideas how to debug or diagnose the issue.
Comment 1 Oleg Cherkasov 2020-02-04 20:02:02 UTC
Got a panic again ... switched back to 11.3-RELEASE-p5
Comment 2 Oleg Cherkasov 2020-02-06 09:43:41 UTC
Created attachment 211408 [details]
new panic around day after

11.3-RELEASE-p5

The stack trace slightly different.

Still not sure why it does not dump to swap.  The swap is 24G so it is enough for minidumps.  Does it mean the kernel lost ada0 device with all partitions on panic?