Bug 227472 (irq, nvme) - nvme(4) panic on 48 CPU system
Summary: nvme(4) panic on 48 CPU system
Status: New
Alias: irq, nvme
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Sean Bruno
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-12 15:41 UTC by Sean Bruno
Modified: 2018-04-13 18:45 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sean Bruno freebsd_committer 2018-04-12 15:41:06 UTC
Using a supermicro system with Hyperthreading enabled (48 cores) generates a panic in nvme(4).  Disabling Hyperthreading (24 cores) boots successfully.


Booting...
KDB: debugger backends: ddb
KDB: current backend: ddb
Copyright (c) 1992-2017 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.0-CURRENT-LLBSD-2017.3.0 #78: Wed Nov  1 04:14:16 UTC 2017
    root@syssw04.phx3.llnw.net:/usr/obj/usr/src/sys/SIXFOUR amd64
FreeBSD clang version 5.0.0 (tags/RELEASE_500/final 312559) (based on LLVM 5.0.0svn)
VT(vga): resolution 640x480
CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz (2100.06-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x50654  Family=0x6  Model=0x55  Stepping=4
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0xd39ffffb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,NFPUSG,MPX,PQE,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,AVX512CD,AVX512BW>
  Structured Extended Features2=0x8<PKU>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
  TSC: P-state invariant, performance statistics
real memory  = 206152138752 (196602 MB)
avail memory = 198991896576 (189773 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <SUPERM SMCI--MB>
FreeBSD/SMP: Multiprocessor System Detected: 48 CPUs
FreeBSD/SMP: 1 package(s) x 24 core(s) x 2 hardware threads
arc4random: no preloaded entropy cache
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 24-31 on motherboard
ioapic2 <Version 2.0> irqs 32-39 on motherboard
ioapic3 <Version 2.0> irqs 40-47 on motherboard
ioapic4 <Version 2.0> irqs 48-55 on motherboard
SMP: AP CPU #1 Launched!
SMP: AP CPU #17 Launched!
SMP: AP CPU #44 Launched!
SMP: AP CPU #32 Launched!
SMP: AP CPU #8 Launched!
SMP: AP CPU #36 Launched!
SMP: AP CPU #24 Launched!
SMP: AP CPU #28 Launched!
SMP: AP CPU #46 Launched!
SMP: AP CPU #30 Launched!
SMP: AP CPU #4 Launched!
SMP: AP CPU #25 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #33 Launched!
SMP: AP CPU #37 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #15 Launched!
SMP: AP CPU #22 Launched!
SMP: AP CPU #38 Launched!
SMP: AP CPU #18 Launched!
SMP: AP CPU #31 Launched!
SMP: AP CPU #11 Launched!
SMP: AP CPU #47 Launched!
SMP: AP CPU #12 Launched!
SMP: AP CPU #23 Launched!
SMP: AP CPU #6 Launched!
SMP: AP CPU #14 Launched!
SMP: AP CPU #26 Launched!
SMP: AP CPU #45 Launched!
SMP: AP CPU #34 Launched!
SMP: AP CPU #13 Launched!
SMP: AP CPU #19 Launched!
SMP: AP CPU #39 Launched!
SMP: AP CPU #20 Launched!
SMP: AP CPU #16 Launched!
SMP: AP CPU #21 Launched!
SMP: AP CPU #9 Launched!
SMP: AP CPU #43 Launched!
SMP: AP CPU #42 Launched!
SMP: AP CPU #41 Launched!
SMP: AP CPU #40 Launched!
SMP: AP CPU #27 Launched!
SMP: AP CPU #29 Launched!
SMP: AP CPU #5 Launched!
SMP: AP CPU #35 Launched!
SMP: AP CPU #10 Launched!
SMP: AP CPU #7 Launched!
Timecounter "TSC" frequency 2100056350 Hz quality 1000
random: entropy device external interface
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff80dd36d0, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
kbd1 at kbdmux0
nexus0
vtvga0: <VT VGA driver> on motherboard
cryptosoft0: <software crypto> on motherboard
acpi0: <SUPERM SUPERM> on motherboard
acpi0: Power Button (fixed)
ACPI Error: Method parse/execution failed \_SB.PC00.IOTR._CRS, AE_AML_NO_RESOURCE_END_TAG (20170531/psparse-677)
ACPI Error: Method execution failed \_SB.PC00.IOTR._CRS, AE_AML_NO_RESOURCE_END_TAG (20170531/uteval-219)
can't fetch resources for \_SB_.PC00.IOTR - AE_AML_NO_RESOURCE_END_TAG
cpu0: <ACPI CPU> numa-domain 0 on acpi0
cpu1: <ACPI CPU> numa-domain 0 on acpi0
cpu2: <ACPI CPU> numa-domain 0 on acpi0
cpu3: <ACPI CPU> numa-domain 0 on acpi0
cpu4: <ACPI CPU> numa-domain 0 on acpi0
cpu5: <ACPI CPU> numa-domain 0 on acpi0
cpu6: <ACPI CPU> numa-domain 0 on acpi0
cpu7: <ACPI CPU> numa-domain 0 on acpi0
cpu8: <ACPI CPU> numa-domain 0 on acpi0
cpu9: <ACPI CPU> numa-domain 0 on acpi0
cpu10: <ACPI CPU> numa-domain 0 on acpi0
cpu11: <ACPI CPU> numa-domain 0 on acpi0
cpu12: <ACPI CPU> numa-domain 0 on acpi0
cpu13: <ACPI CPU> numa-domain 0 on acpi0
cpu14: <ACPI CPU> numa-domain 0 on acpi0
cpu15: <ACPI CPU> numa-domain 0 on acpi0
cpu16: <ACPI CPU> numa-domain 0 on acpi0
cpu17: <ACPI CPU> numa-domain 0 on acpi0
cpu18: <ACPI CPU> numa-domain 0 on acpi0
cpu19: <ACPI CPU> numa-domain 0 on acpi0
cpu20: <ACPI CPU> numa-domain 0 on acpi0
cpu21: <ACPI CPU> numa-domain 0 on acpi0
cpu22: <ACPI CPU> numa-domain 0 on acpi0
cpu23: <ACPI CPU> numa-domain 0 on acpi0
cpu24: <ACPI CPU> numa-domain 0 on acpi0
cpu25: <ACPI CPU> numa-domain 0 on acpi0
cpu26: <ACPI CPU> numa-domain 0 on acpi0
cpu27: <ACPI CPU> numa-domain 0 on acpi0
cpu28: <ACPI CPU> numa-domain 0 on acpi0
cpu29: <ACPI CPU> numa-domain 0 on acpi0
cpu30: <ACPI CPU> numa-domain 0 on acpi0
cpu31: <ACPI CPU> numa-domain 0 on acpi0
cpu32: <ACPI CPU> numa-domain 0 on acpi0
cpu33: <ACPI CPU> numa-domain 0 on acpi0
cpu34: <ACPI CPU> numa-domain 0 on acpi0
cpu35: <ACPI CPU> numa-domain 0 on acpi0
cpu36: <ACPI CPU> numa-domain 0 on acpi0
cpu37: <ACPI CPU> numa-domain 0 on acpi0
cpu38: <ACPI CPU> numa-domain 0 on acpi0
cpu39: <ACPI CPU> numa-domain 0 on acpi0
cpu40: <ACPI CPU> numa-domain 0 on acpi0
cpu41: <ACPI CPU> numa-domain 0 on acpi0
cpu42: <ACPI CPU> numa-domain 0 on acpi0
cpu43: <ACPI CPU> numa-domain 0 on acpi0
cpu44: <ACPI CPU> numa-domain 0 on acpi0
cpu45: <ACPI CPU> numa-domain 0 on acpi0
cpu46: <ACPI CPU> numa-domain 0 on acpi0
cpu47: <ACPI CPU> numa-domain 0 on acpi0
atrtc0: <AT realtime clock> port 0x70-0x71,0x74-0x77 irq 8 on acpi0
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 24000000 Hz quality 950
Event timer "HPET" frequency 24000000 Hz quality 350
Event timer "HPET1" frequency 24000000 Hz quality 340
Event timer "HPET2" frequency 24000000 Hz quality 340
Event timer "HPET3" frequency 24000000 Hz quality 340
Event timer "HPET4" frequency 24000000 Hz quality 340
Event timer "HPET5" frequency 24000000 Hz quality 340
Event timer "HPET6" frequency 24000000 Hz quality 340
Event timer "HPET7" frequency 24000000 Hz quality 340
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x508-0x50b on acpi0
acpi_syscontainer0: <System Container> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff numa-domain 0 on acpi0
pci0: <ACPI PCI bus> numa-domain 0 on pcib0
pci0: <dasp, performance counters> at device 8.1 (no driver attached)
pci0: <unknown> at device 17.0 (no driver attached)
pci0: <unknown> at device 17.1 (no driver attached)
ahci0: <AHCI SATA controller> port 0x3030-0x3037,0x3020-0x3023,0x3000-0x301f mem 0xaa184000-0xaa185fff,0xaa186000-0xaa1860ff,0xaa100000-0xaa17ffff irq 16 at device 17.5 numa-domain 0 on pci0
ahci0: AHCI v1.31 with 3 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0
xhci0: <XHCI (generic) USB 3.0 controller> mem 0x383ffff00000-0x383ffff0ffff irq 16 at device 20.0 numa-domain 0 on pci0
xhci0: 32 bytes context size, 64-bit DMA
usbus0: waiting for BIOS to give up control
xhci_interrupt: host controller halted
usbus0 numa-domain 0 on xhci0
usbus0: 5.0Gbps Super Speed USB v3.0
pci0: <simple comms> at device 22.0 (no driver attached)
pci0: <simple comms> at device 22.1 (no driver attached)
pci0: <simple comms> at device 22.4 (no driver attached)
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 numa-domain 0 on pci0
pci1: <ACPI PCI bus> numa-domain 0 on pcib1
pcib2: <ACPI PCI-PCI bridge> irq 17 at device 28.5 numa-domain 0 on pci0
pci2: <ACPI PCI bus> numa-domain 0 on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 17 at device 0.0 numa-domain 0 on pci2
pci3: <ACPI PCI bus> numa-domain 0 on pcib3
vgapci0: <VGA-compatible display> port 0x2000-0x207f mem 0xa9000000-0xa9ffffff,0xaa000000-0xaa01ffff irq 17 at device 0.0 numa-domain 0 on pci3
vgapci0: Boot video device
isab0: <PCI-ISA bridge> at device 31.0 numa-domain 0 on pci0
isa0: <ISA bus> numa-domain 0 on isab0
pci0: <memory> at device 31.2 (no driver attached)
pci0: <serial bus> at device 31.5 (no driver attached)
pcib4: <ACPI Host-PCI bridge> numa-domain 0 on acpi0
pci4: <ACPI PCI bus> numa-domain 0 on pcib4
pcib5: <ACPI PCI-PCI bridge> irq 39 at device 0.0 numa-domain 0 on pci4
pci5: <ACPI PCI bus> numa-domain 0 on pcib5
t6nex0: <Chelsio T6225-CR> mem 0xc5300000-0xc537ffff,0xc4000000-0xc4ffffff,0xc5984000-0xc5985fff irq 32 at device 0.4 numa-domain 0 on pci5
cc0: <port 0> numa-domain 0 on t6nex0
cc0: Ethernet address: 00:07:43:3f:0b:c0
cc0: 16 txq, 8 rxq (NIC)
cc1: <port 1> numa-domain 0 on t6nex0
cc1: Ethernet address: 00:07:43:3f:0b:c8
cc1: 16 txq, 8 rxq (NIC)
t6nex0: PCIe gen3 x8, 2 ports, 18 MSI-X interrupts, 51 eq, 17 iq
pci5: <mass storage, SCSI> at device 0.5 (no driver attached)
pci5: <serial bus, Fibre Channel> at device 0.6 (no driver attached)
pcib6: <ACPI PCI-PCI bridge> irq 39 at device 2.0 numa-domain 0 on pci4
pci6: <ACPI PCI bus> numa-domain 0 on pcib6
pcib7: <ACPI PCI-PCI bridge> mem 0xc5c00000-0xc5c1ffff irq 34 at device 0.0 numa-domain 0 on pci6
pci7: <ACPI PCI bus> numa-domain 0 on pcib7
pcib8: <ACPI PCI-PCI bridge> irq 34 at device 3.0 numa-domain 0 on pci7
pci8: <ACPI PCI bus> numa-domain 0 on pcib8
ixl0: <Intel(R) Ethernet Connection XL710/X722 Driver, Version - 1.7.12-k> mem 0xc2000000-0xc2ffffff,0xc3008000-0xc300ffff irq 36 at device 0.0 numa-domain 0 on pci8
ixl0: Using MSIX interrupts with 9 vectors
ixl0: fw 3.1.49755 api 1.5 nvm 3.1d etid 80000827 oem 1.262.0
ixl0: PF-ID[0]: VFs 32, MSIX 129, VF MSIX 5, QPs 768, MDIO shared
ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl0: Ethernet address: ac:1f:6b:0a:12:da
ixl0: Link is up, 1 Gbps Full Duplex, FEC: None, Autoneg: True, Flow Control: None
ixl0: link state changed to UP
pci12: <PCI bus> numa-domain 0 on pcib12
nvme0: <Generic NVMe Device> mem 0xe0e10000-0xe0e13fff irq 40 at device 0.0 numa-domain 0 on pci12
pcib13: <PCI-PCI bridge> at device 1.0 numa-domain 0 on pci11
pci13: <PCI bus> numa-domain 0 on pcib13
nvme1: <Generic NVMe Device> mem 0xe0d10000-0xe0d13fff irq 44 at device 0.0 numa-domain 0 on pci13
pcib14: <PCI-PCI bridge> at device 2.0 numa-domain 0 on pci11
pci14: <PCI bus> numa-domain 0 on pcib14
nvme2: <Generic NVMe Device> mem 0xe0c10000-0xe0c13fff irq 45 at device 0.0 numa-domain 0 on pci14
pcib15: <PCI-PCI bridge> at device 3.0 numa-domain 0 on pci11
pci15: <PCI bus> numa-domain 0 on pcib15
nvme3: <Generic NVMe Device> mem 0xe0b10000-0xe0b13fff irq 46 at device 0.0 numa-domain 0 on pci15
pcib16: <PCI-PCI bridge> at device 4.0 numa-domain 0 on pci11
pci16: <PCI bus> numa-domain 0 on pcib16
nvme4: <Generic NVMe Device> mem 0xe0a10000-0xe0a13fff irq 40 at device 0.0 numa-domain 0 on pci16
pcib17: <PCI-PCI bridge> at device 5.0 numa-domain 0 on pci11
pci17: <PCI bus> numa-domain 0 on pcib17
pcib18: <PCI-PCI bridge> at device 6.0 numa-domain 0 on pci11
pci18: <PCI bus> numa-domain 0 on pcib18
pcib19: <PCI-PCI bridge> at device 7.0 numa-domain 0 on pci11
pci19: <PCI bus> numa-domain 0 on pcib19
pcib20: <ACPI Host-PCI bridge> numa-domain 0 on acpi0
pci20: <ACPI PCI bus> numa-domain 0 on pcib20
pcib21: <ACPI PCI-PCI bridge> irq 55 at device 0.0 numa-domain 0 on pci20
pci21: <ACPI PCI bus> numa-domain 0 on pcib21
pcib22: <PCI-PCI bridge> at device 0.0 numa-domain 0 on pci21
pci22: <PCI bus> numa-domain 0 on pcib22
pcib23: <PCI-PCI bridge> at device 0.0 numa-domain 0 on pci22
pci23: <PCI bus> numa-domain 0 on pcib23
nvme5: <Generic NVMe Device> mem 0xfbe10000-0xfbe13fff irq 48 at device 0.0 numa-domain 0 on pci23
pcib24: <PCI-PCI bridge> at device 1.0 numa-domain 0 on pci22
pci24: <PCI bus> numa-domain 0 on pcib24
nvme6: <Generic NVMe Device> mem 0xfbd10000-0xfbd13fff irq 52 at device 0.0 numa-domain 0 on pci24
pcib25: <PCI-PCI bridge> at device 2.0 numa-domain 0 on pci22
pci25: <PCI bus> numa-domain 0 on pcib25
nvme7: <Generic NVMe Device> mem 0xfbc10000-0xfbc13fff irq 53 at device 0.0 numa-domain 0 on pci25
pcib26: <PCI-PCI bridge> at device 3.0 numa-domain 0 on pci22
pci26: <PCI bus> numa-domain 0 on pcib26
nvme8: <Generic NVMe Device> mem 0xfbb10000-0xfbb13fff irq 54 at device 0.0 numa-domain 0 on pci26
pcib27: <PCI-PCI bridge> at device 4.0 numa-domain 0 on pci22
pci27: <PCI bus> numa-domain 0 on pcib27
nvme9: <Generic NVMe Device> mem 0xfba10000-0xfba13fff irq 48 at device 0.0 numa-domain 0 on pci27
pcib28: <PCI-PCI bridge> at device 5.0 numa-domain 0 on pci22
pci28: <PCI bus> numa-domain 0 on pcib28
pcib29: <PCI-PCI bridge> at device 6.0 numa-domain 0 on pci22
pci29: <PCI bus> numa-domain 0 on pcib29
pcib30: <PCI-PCI bridge> at device 7.0 numa-domain 0 on pci22
pci30: <PCI bus> numa-domain 0 on pcib30
pci20: <dasp, performance counters> at device 14.0 (no driver attached)
pci20: <dasp, performance counters> at device 15.0 (no driver attached)
pci20: <dasp, performance counters> at device 16.0 (no driver attached)
pci20: <dasp, performance counters> at device 18.0 (no driver attached)
pci20: <dasp, performance counters> at device 18.1 (no driver attached)
pci20: <dasp, performance counters> at device 18.4 (no driver attached)
pci20: <dasp, performance counters> at device 18.5 (no driver attached)
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 on acpi0
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 flags 0x10 on acpi0
uart1: console (115200,n,8,1)
ipmi0: <IPMI System Interface> port 0xca2,0xca3 on acpi0
ipmi0: KCS mode found at io 0xca2 on acpi
orm0: <ISA Option ROM> at iomem 0xc0000-0xc7fff on isa0
ppc0: cannot reserve I/O port range
coretemp0: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu0
est0: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu0
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est0 attach returned 6
coretemp1: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu1
est1: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu1
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est1 attach returned 6
coretemp2: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu2
est2: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu2
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est2 attach returned 6
coretemp3: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu3
est3: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu3
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est3 attach returned 6
coretemp4: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu4
est4: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu4
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est4 attach returned 6
coretemp5: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu5
est5: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu5
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est5 attach returned 6
coretemp6: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu6
est6: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu6
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est6 attach returned 6
coretemp7: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu7
est7: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu7
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est7 attach returned 6
coretemp8: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu8
est8: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu8
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est8 attach returned 6
coretemp9: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu9
est9: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu9
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est9 attach returned 6
coretemp10: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu10
est10: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu10
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est10 attach returned 6
coretemp11: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu11
est11: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu11
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est11 attach returned 6
coretemp12: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu12
est12: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu12
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est12 attach returned 6
coretemp13: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu13
est13: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu13
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est13 attach returned 6
coretemp14: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu14
est14: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu14
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est14 attach returned 6
coretemp15: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu15
est15: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu15
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est15 attach returned 6
coretemp16: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu16
est16: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu16
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est16 attach returned 6
coretemp17: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu17
est17: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu17
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est17 attach returned 6
coretemp18: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu18
est18: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu18
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est18 attach returned 6
coretemp19: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu19
est19: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu19
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est19 attach returned 6
coretemp20: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu20
est20: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu20
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est20 attach returned 6
coretemp21: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu21
est21: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu21
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est21 attach returned 6
coretemp22: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu22
est22: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu22
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est22 attach returned 6
coretemp23: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu23
est23: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu23
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est23 attach returned 6
coretemp24: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu24
est24: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu24
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est24 attach returned 6
coretemp25: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu25
est25: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu25
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est25 attach returned 6
coretemp26: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu26
est26: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu26
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est26 attach returned 6
coretemp27: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu27
est27: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu27
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est27 attach returned 6
coretemp28: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu28
est28: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu28
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est28 attach returned 6
coretemp29: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu29
est29: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu29
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est29 attach returned 6
coretemp30: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu30
est30: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu30
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est30 attach returned 6
coretemp31: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu31
est31: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu31
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est31 attach returned 6
coretemp32: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu32
est32: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu32
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est32 attach returned 6
coretemp33: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu33
est33: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu33
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est33 attach returned 6
coretemp34: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu34
est34: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu34
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est34 attach returned 6
coretemp35: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu35
est35: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu35
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19ef00001500
device_attach: est35 attach returned 6
coretemp36: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu36
est36: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu36
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est36 attach returned 6
coretemp37: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu37
est37: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu37
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est37 attach returned 6
coretemp38: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu38
est38: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu38
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est38 attach returned 6
coretemp39: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu39
est39: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu39
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est39 attach returned 6
coretemp40: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu40
est40: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu40
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est40 attach returned 6
coretemp41: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu41
est41: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu41
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est41 attach returned 6
coretemp42: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu42
est42: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu42
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est42 attach returned 6
coretemp43: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu43
est43: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu43
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est43 attach returned 6
coretemp44: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu44
est44: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu44
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est44 attach returned 6
coretemp45: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu45
est45: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu45
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est45 attach returned 6
coretemp46: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu46
est46: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu46
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est46 attach returned 6
coretemp47: <CPU On-Die Thermal Sensors> numa-domain 0 on cpu47
est47: <Enhanced SpeedStep Frequency Control> numa-domain 0 on cpu47
est: CPU supports Enhanced Speedstep, but is not recognized.
est: cpu_vendor GenuineIntel, msr 19e800001500
device_attach: est47 attach returned 6
Timecounters tick every 1.000 msec
arc4random: no preloaded entropy cache
Attempting to load tcp_bbr
tcp_bbr is now available
ipfw2 (+ipv6) initialized, divert loadable, nat loadable, default to accept, logging disabled
ugen0.1: <0x8086 XHCI root HUB> at usbus0
uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
nvd0: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd0: 1144641MB (2344225968 512 byte sectors)
ada0 at ahcich2 bus 0 scbus2 target 0 lun 0
ada0: <Micron M600 MTFDDAV128MBF MU04> ACS-3 ATA SATA 3.x device
ada0: Serial Number 16051323B113
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 122104MB (250069680 512 byte sectors)
ses0 at ahciem0 bus 0 scbus3 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
nvd1: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd1: 1144641MB (2344225968 512 byte sectors)
nvd2: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
Legacy Option ROM Initialization.                                            B2 
nvd3: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd3: 1144641MB (2344225968 512 byte sectors)
nvd4: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd4: 1144641MB (2344225968 512 byte sectors)
nvd5: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd5: 1144641MB (2344225968 512 byte sectors)
nvd6: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd6: 1144641MB (2344225968 512 byte sectors)
nvd7: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd7: 1144641MB (2344225968 512 byte sectors)
nvd8: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd8: 1144641MB (2344225968 512 byte sectors)
panic: nexus_setup_intr: NULL irq resource!
cpuid = 40
time = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82187e30
vpanic() at vpanic+0x19c/frame 0xffffffff82187eb0
panic() at panic+0x43/frame 0xffffffff82187f10
nexus_setup_intr() at nexus_setup_intr+0xa7/frame 0xffffffff82187f60
pci_setup_intr() at pci_setup_intr+0x27/frame 0xffffffff82188000
pci_setup_intr() at pci_setup_intr+0x27/frame 0xffffffff821880a0
pci_setup_intr() at pci_setup_intr+0x27/frame 0xffffffff82188140
pci_setup_intr() at pci_setup_intr+0x27/frame 0xffffffff821881e0
bus_setup_intr() at bus_setup_intr+0xa1/frame 0xffffffff82188240
nvme_qpair_construct() at nvme_qpair_construct+0x95/frame 0xffffffff821882c0
nvme_ctrlr_start_config_hook() at nvme_ctrlr_start_config_hook+0x1a3/frame 0xffffffff82188310
run_interrupt_driven_config_hooks() at run_interrupt_driven_config_hooks+0x110/frame 0xffffffff82188340
boot_run_interrupt_driven_config_hooks() at boot_run_interrupt_driven_config_hooks+0x22/frame 0xffffffff821883d0
mi_startup() at mi_startup+0x9c/frame 0xffffffff821883f0
btext() at btext+0x2c
Comment 1 Sean Bruno freebsd_committer 2018-04-12 16:16:48 UTC
I added a debug panic in nvme_qpair_construct() to check the return of bus_alloc_resource_any() for an IRQ.  We seem to be getting NULL back, which I assume means with 48 cores, we are out of IRQs:

        if (ctrlr->msix_enabled) {

                /*
                 * MSI-X vector resource IDs start at 1, so we add one to
                 *  the queue's vector to get the corresponding rid to use.  
                 */
                qpair->rid = vector + 1;

                qpair->res = bus_alloc_resource_any(ctrlr->dev, SYS_RES_IRQ,
                    &qpair->rid, RF_ACTIVE);
                if (qpair->res == NULL)
                        panic("%s: bus_alloc_resource_any for IRQ failed", __func__);
                bus_setup_intr(ctrlr->dev, qpair->res,
                    INTR_TYPE_MISC | INTR_MPSAFE, NULL,
                    nvme_qpair_msix_handler, qpair, &qpair->tag);
        }





nvd1: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd1: 1144641MB (2344225968 512 byte sectors)
nvd2: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd2: 1144641MB (2344225968 512 byte sectors)
nvd3: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd3: 1144641MB (2344225968 512 byte sectors)
nvd4: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd4: 1144641MB (2344225968 512 byte sectors)
nvd5: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd5: 1144641MB (2344225968 512 byte sectors)
nvd6: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd6: 1144641MB (2344225968 512 byte sectors)
nvd7: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd7: 1144641MB (2344225968 512 byte sectors)
nvd8: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace
nvd8: 1144641MB (2344225968 512 byte sectors)
panic: nvme_qpair_construct: bus_alloc_resource_any for IRQ failed
cpuid = 40
time = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82188170
vpanic() at vpanic+0x19c/frame 0xffffffff821881f0
panic() at panic+0x43/frame 0xffffffff82188250
nvme_qpair_construct() at nvme_qpair_construct+0x479/frame 0xffffffff821882c0
nvme_ctrlr_start_config_hook() at nvme_ctrlr_start_config_hook+0x1a3/frame 0xffffffff82188310
run_interrupt_driven_config_hooks() at run_interrupt_driven_config_hooks+0x110/frame 0xffffffff82188340
boot_run_interrupt_driven_config_hooks() at boot_run_interrupt_driven_config_hooks+0x22/frame 0xffffffff821883d0
mi_startup() at mi_startup+0x9c/frame 0xffffffff821883f0
btext() at btext+0x2c
Uptime: 2s
Comment 2 Sean Bruno freebsd_committer 2018-04-12 16:31:59 UTC
I reduced the number of IRQs required but probably am taking a hit on performance with:

hw.nvme.min_cpus_per_ioq="2"

At least the system will post now.
Comment 3 John Baldwin freebsd_committer freebsd_triage 2018-04-13 18:29:29 UTC
Please try https://reviews.freebsd.org/P165
Comment 4 John Baldwin freebsd_committer freebsd_triage 2018-04-13 18:45:15 UTC
Well, this panic is actually a different one from the one P165 aims to fix, but the nvme driver is probably using 'mp_ncpus' to determine the number of IRQs to allocate which isn't going to work out very well.  In particular, we don't schedule IRQs on HT threads by default, but only on the first thread in a core.  This means if nvme is allocating 48 interrupts per device, it is actually allocating 2 interrupts for each core (e.g. 2 IRQs on CPU 0, 2 on CPU2, etc.) when not using bus_bind_intr().  If you have NUMA enabled, then we now also restrict IRQs to cores local to the device.  Rather than using 'mp_ncpus', the driver should use bus_get_cpus() with INTR_CPUS to determine the set of CPUs it should bind interrupts to.  It can then use 'CPU_COUNT' on the result in place of 'mp_ncpus'.

(It might still be worth testing to see if P165 makes a difference.)