Bug 247208 - mpt(4): VMWare virtualized LSI controller panics during hot-attach
Summary: mpt(4): VMWare virtualized LSI controller panics during hot-attach
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2020-06-12 14:34 UTC by Allan Jude
Modified: 2020-09-01 16:07 UTC (History)
3 users (show)

See Also:
koobs: mfc-stable12?
koobs: mfc-stable11?


Attachments
proposed patch (1.53 KB, patch)
2020-06-12 14:58 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Allan Jude freebsd_committer 2020-06-12 14:34:36 UTC
Hot-attaching an LSI controller to a VMWare instance causes a panic on 2 out of 2 machines I tested on:


Jun 12 09:47:52 nfs-server-00 pcib11: Attention Button Pressed: Detaching in 5 seconds
Jun 12 09:47:52 nfs-server-00 pci4: <ACPI PCI bus> on pcib11
Jun 12 09:47:52 nfs-server-00 mpt1: <LSILogic SAS/SATA Adapter> at device 0.0 on pci4
Jun 12 09:47:52 nfs-server-00 mpt1: MPI Version=1.5.0.0
Jun 12 09:47:52 nfs-server-00 mpt1: cannot allocate 262144 bytes of request memory
Jun 12 09:47:52 nfs-server-00 mpt1: mpt_dma_buf_alloc() failed!
Jun 12 09:47:52 nfs-server-00 mpt1: failed to enable port 0
Jun 12 09:49:26 nfs-server-00 syslog-ng[1045]: syslog-ng starting up; version='3.23.1'
Jun 12 09:49:26 nfs-server-00 Fatal trap 12: page fault while in kernel mode
Jun 12 09:49:26 nfs-server-00 cpuid = 0; apic id = 00
Jun 12 09:49:26 nfs-server-00 fault virtual address = 0x10
Jun 12 09:49:26 nfs-server-00 fault code            = supervisor read data, page not present
Jun 12 09:49:26 nfs-server-00 instruction pointer   = 0x20:0xffffffff80e67174
Jun 12 09:49:26 nfs-server-00 stack pointer         = 0x28:0xfffffe083e1a6940
Jun 12 09:49:26 nfs-server-00 frame pointer         = 0x28:0xfffffe083e1a6940
Jun 12 09:49:26 nfs-server-00 code segment          = base rx0, limit 0xfffff, type 0x1b
Jun 12 09:49:26 nfs-server-00                       = DPL 0, pres 1, long 1, def32 0, gran 1
Jun 12 09:49:26 nfs-server-00 processor eflags      = interrupt enabled, resume, IOPL = 0
Jun 12 09:49:26 nfs-server-00 current process               = 0 (thread taskq)
Jun 12 09:49:26 nfs-server-00 trap number           = 12
Jun 12 09:49:26 nfs-server-00 panic: page fault
Jun 12 09:49:26 nfs-server-00 cpuid = 0
Jun 12 09:49:26 nfs-server-00 KDB: stack backtrace:
Jun 12 09:49:26 nfs-server-00 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe083e1a65f0
Jun 12 09:49:26 nfs-server-00 vpanic() at vpanic+0x17e/frame 0xfffffe083e1a6650
Jun 12 09:49:26 nfs-server-00 panic() at panic+0x43/frame 0xfffffe083e1a66b0
Jun 12 09:49:26 nfs-server-00 trap_fatal() at trap_fatal+0x369/frame 0xfffffe083e1a6700
Jun 12 09:49:26 nfs-server-00 trap_pfault() at trap_pfault+0x49/frame 0xfffffe083e1a6760
Jun 12 09:49:26 nfs-server-00 trap() at trap+0x29d/frame 0xfffffe083e1a6870
Jun 12 09:49:26 nfs-server-00 calltrap() at calltrap+0x8/frame 0xfffffe083e1a6870
Jun 12 09:49:26 nfs-server-00 --- trap 0xc, rip = 0xffffffff80e67174, rsp = 0xfffffe083e1a6940, rbp = 0xfffffe083e1a6940 ---
Jun 12 09:49:26 nfs-server-00 vm_page_next() at vm_page_next+0x4/frame 0xfffffe083e1a6940
Jun 12 09:49:26 nfs-server-00 kmem_unback() at kmem_unback+0x88/frame 0xfffffe083e1a6980
Jun 12 09:49:26 nfs-server-00 kmem_free() at kmem_free+0x43/frame 0xfffffe083e1a69b0
Jun 12 09:49:26 nfs-server-00 mpt_core_detach() at mpt_core_detach+0xf9/frame 0xfffffe083e1a69d0
Jun 12 09:49:26 nfs-server-00 mpt_detach() at mpt_detach+0xd5/frame 0xfffffe083e1a69f0
Jun 12 09:49:26 nfs-server-00 mpt_pci_detach() at mpt_pci_detach+0x23/frame 0xfffffe083e1a6a10
Jun 12 09:49:26 nfs-server-00 device_detach() at device_detach+0x183/frame 0xfffffe083e1a6a50
Jun 12 09:49:26 nfs-server-00 bus_generic_detach() at bus_generic_detach+0x48/frame 0xfffffe083e1a6a70
Jun 12 09:49:26 nfs-server-00 pci_detach() at pci_detach+0xe/frame 0xfffffe083e1a6a90
Jun 12 09:49:26 nfs-server-00 device_detach() at device_detach+0x183/frame 0xfffffe083e1a6ad0
Jun 12 09:49:26 nfs-server-00 device_delete_child() at device_delete_child+0x15/frame 0xfffffe083e1a6af0
Jun 12 09:49:26 nfs-server-00 pcib_pcie_hotplug_task() at pcib_pcie_hotplug_task+0x87/frame 0xfffffe083e1a6b20
Jun 12 09:49:26 nfs-server-00 taskqueue_run_locked() at taskqueue_run_locked+0x185/frame 0xfffffe083e1a6b80
Jun 12 09:49:26 nfs-server-00 taskqueue_thread_loop() at taskqueue_thread_loop+0xb8/frame 0xfffffe083e1a6bb0
Jun 12 09:49:26 nfs-server-00 fork_exit() at fork_exit+0x83/frame 0xfffffe083e1a6bf0
Comment 1 Mark Johnston freebsd_committer 2020-06-12 14:50:12 UTC
Looks like mpt_dma_buf_free() doesn't handle this particular failure mode properly.
Comment 2 Mark Johnston freebsd_committer 2020-06-12 14:58:52 UTC
Created attachment 215490 [details]
proposed patch

Could you try this patch?  It'll still be necessary to figure out why the memory allocation is failing, but hopefully it won't panic anymore.

mpt_dma_mem_free() has the same problem for now.
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2020-08-10 03:19:42 UTC
@Allan have you had a chance to test this?
Comment 4 Josh Paetzel freebsd_committer 2020-08-29 20:43:37 UTC
For a baseline I tried:


root@freebsd13:/home/jpaetzel # uname -a
FreeBSD freebsd13.demosp.com 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r364438: Fri Aug 21 09:29:22 PDT 2020     jpaetzel@freebsd13.demosp.com:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

I hot added an MPT controller to the system and didn't get a panic.

What version of FreeBSD were you using that got the panic?
Comment 5 Josh Paetzel freebsd_committer 2020-09-01 15:08:48 UTC
Ok, so I see the reason I couldn't reproduce the panic is this patch has already been committed to HEAD.

So I think that's verification that the patch works.
Comment 6 Mark Johnston freebsd_committer 2020-09-01 15:39:50 UTC
(In reply to Josh Paetzel from comment #5)
It was not committed as far as I know.  The patch fixes bugs in the controller detach path.  That is, the driver fails to attach for some reason (which you don't seem to hit), and panics when cleaning up.
Comment 7 Josh Paetzel freebsd_committer 2020-09-01 16:07:09 UTC
I see said the blind man as he pissed into the wind, it's all coming back to me now.

I had built the kernel with the patch, but tried to reproduce on the original kernel.  I couldn't reproduce the issue, then looked at your patch and my sources (which already had the patch applied) and came to he conclusion that I couldn't reproduce the issue because I was testing the fix.

To be clear, I was testing stock FreeBSD HEAD WITHOUT the patch and could not reproduce the problem.