Bug 207432 - panic: nvme_ctrlr_intx_handler
Summary: panic: nvme_ctrlr_intx_handler
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-BETA2
Hardware: amd64 Any
: Normal Affects Some People
Assignee: Jim Harris
URL:
Keywords: crash, needs-qa, regression
Depends on:
Blocks:
 
Reported: 2016-02-23 05:49 UTC by Andy Carrel
Modified: 2016-03-07 02:46 UTC (History)
4 users (show)

See Also:
koobs: mfc-stable10?


Attachments
Patch for bug 207432 (423 bytes, patch)
2016-02-23 15:46 UTC, Jim Harris
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andy Carrel 2016-02-23 05:49:50 UTC
I've created a 10.3-BETA2 image for Google Compute Engine using swills'
script and am getting a panic on boot when the VM is configured with Local
SSD as NVMe (--local-ssd interface="NVME"). This is a regression from
10.2-RELEASE which will boot successfully with an identical configuration.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x60
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80e16019
stack pointer        = 0x28:0xfffffe01bfff59c0
frame pointer        = 0x28:0xfffffe01bfff59e0
code segment = base rx0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq11: virtio_pci0+)
[ thread pid 12 tid 100039 ]
Stopped at      nvme_ctrlr_intx_handler+0x39:   cmpq    $0,0x60(%rdi)
db> bt
Tracing pid 12 tid 100039 td 0xfffff8000422e000
nvme_ctrlr_intx_handler() at nvme_ctrlr_intx_handler+0x39/frame
0xfffffe01bfff59e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame
0xfffffe01bfff5a20
ithread_loop() at ithread_loop+0x96/frame 0xfffffe01bfff5a70
fork_exit() at fork_exit+0x9a/frame 0xfffffe01bfff5ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01bfff5ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Later with kgdb looks like ctrlr->ioq is null. (Line numbers won't line up exactly since I was attempting suggestions from the freebsd-stable@ thread, but panic was the same in all cases.)

0xffffffff80e16029 in nvme_ctrlr_intx_handler (arg=0xfffffe0000953000) at /usr/src/sys/dev/nvme/nvme_ctrlr.c:819
819		if (ctrlr->ioq[0].cpl)
(kgdb) print ((struct nvme_controller *)arg)->ioq
$1 = (struct nvme_qpair *) 0x0
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2016-02-23 06:05:59 UTC
Thank you for the report Andy

If possible, could you:

* Confirm whether or not the issue is reproducible on a recent 11.0-CURRENT
* Include (as an attachment) another backtrace after the panic if reproducible
Comment 2 Jim Harris freebsd_committer freebsd_triage 2016-02-23 15:46:08 UTC
This is a regression due to r293328.  This will happen on 11-CURRENT as well.

r293328 changed when the controller's ioq array was allocated, such that when we start getting INTx interrupts for the admin queue, ioq is not allocated yet and caused this panic.

See attached patch.
Comment 3 Jim Harris freebsd_committer freebsd_triage 2016-02-23 15:46:49 UTC
Created attachment 167329 [details]
Patch for bug 207432
Comment 4 Andy Carrel 2016-02-23 16:04:19 UTC
It's probably worth noting that avoiding INTx with 'hw.pci.honor_msi_blacklist=0' in /boot/loader.conf allows things to boot and function normally.
Comment 5 Jim Harris freebsd_committer freebsd_triage 2016-02-23 19:56:43 UTC
Hi Andy,

Are you able to test the attached patch?  I'm pretty sure this fixes your issue but wanted to wait to commit in case you can verify it.

Thanks,

-Jim
Comment 6 Andy Carrel 2016-02-23 21:05:40 UTC
I can confirm that the attached patch does boot successfully in INTx mode. Thanks!
Comment 7 commit-hook freebsd_committer freebsd_triage 2016-02-24 00:01:52 UTC
A commit references this bug:

Author: jimharris
Date: Wed Feb 24 00:01:10 UTC 2016
New revision: 295944
URL: https://svnweb.freebsd.org/changeset/base/295944

Log:
  nvme: fix intx handler to not dereference ioq during initialization

  This was a regression from r293328, which deferred allocation
  of the controller's ioq array until after interrupts are enabled
  during boot.

  PR:		207432
  Reported and tested by: Andy Carrel <wac@google.com>
  MFC after:	3 days
  Sponsored by:	Intel

Changes:
  head/sys/dev/nvme/nvme_ctrlr.c