I've created a 10.3-BETA2 image for Google Compute Engine using swills'
script and am getting a panic on boot when the VM is configured with Local
SSD as NVMe (--local-ssd interface="NVME"). This is a regression from
10.2-RELEASE which will boot successfully with an identical configuration.
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x60
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80e16019
stack pointer = 0x28:0xfffffe01bfff59c0
frame pointer = 0x28:0xfffffe01bfff59e0
code segment = base rx0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq11: virtio_pci0+)
[ thread pid 12 tid 100039 ]
Stopped at nvme_ctrlr_intx_handler+0x39: cmpq $0,0x60(%rdi)
Tracing pid 12 tid 100039 td 0xfffff8000422e000
nvme_ctrlr_intx_handler() at nvme_ctrlr_intx_handler+0x39/frame
intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame
ithread_loop() at ithread_loop+0x96/frame 0xfffffe01bfff5a70
fork_exit() at fork_exit+0x9a/frame 0xfffffe01bfff5ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01bfff5ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Later with kgdb looks like ctrlr->ioq is null. (Line numbers won't line up exactly since I was attempting suggestions from the freebsd-stable@ thread, but panic was the same in all cases.)
0xffffffff80e16029 in nvme_ctrlr_intx_handler (arg=0xfffffe0000953000) at /usr/src/sys/dev/nvme/nvme_ctrlr.c:819
819 if (ctrlr->ioq.cpl)
(kgdb) print ((struct nvme_controller *)arg)->ioq
$1 = (struct nvme_qpair *) 0x0
Thank you for the report Andy
If possible, could you:
* Confirm whether or not the issue is reproducible on a recent 11.0-CURRENT
* Include (as an attachment) another backtrace after the panic if reproducible
This is a regression due to r293328. This will happen on 11-CURRENT as well.
r293328 changed when the controller's ioq array was allocated, such that when we start getting INTx interrupts for the admin queue, ioq is not allocated yet and caused this panic.
See attached patch.
Created attachment 167329 [details]
Patch for bug 207432
It's probably worth noting that avoiding INTx with 'hw.pci.honor_msi_blacklist=0' in /boot/loader.conf allows things to boot and function normally.
Are you able to test the attached patch? I'm pretty sure this fixes your issue but wanted to wait to commit in case you can verify it.
I can confirm that the attached patch does boot successfully in INTx mode. Thanks!
A commit references this bug:
Date: Wed Feb 24 00:01:10 UTC 2016
New revision: 295944
nvme: fix intx handler to not dereference ioq during initialization
This was a regression from r293328, which deferred allocation
of the controller's ioq array until after interrupts are enabled
Reported and tested by: Andy Carrel <email@example.com>
MFC after: 3 days
Sponsored by: Intel