Summary: | nvme(4) panic on 48 CPU system | ||
---|---|---|---|
Product: | Base System | Reporter: | Sean Bruno <sbruno> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | bsdimp, cem, imp, jhb |
Priority: | --- | ||
Version: | CURRENT | ||
Hardware: | amd64 | ||
OS: | Any |
Description
Sean Bruno
2018-04-12 15:41:06 UTC
I added a debug panic in nvme_qpair_construct() to check the return of bus_alloc_resource_any() for an IRQ. We seem to be getting NULL back, which I assume means with 48 cores, we are out of IRQs: if (ctrlr->msix_enabled) { /* * MSI-X vector resource IDs start at 1, so we add one to * the queue's vector to get the corresponding rid to use. */ qpair->rid = vector + 1; qpair->res = bus_alloc_resource_any(ctrlr->dev, SYS_RES_IRQ, &qpair->rid, RF_ACTIVE); if (qpair->res == NULL) panic("%s: bus_alloc_resource_any for IRQ failed", __func__); bus_setup_intr(ctrlr->dev, qpair->res, INTR_TYPE_MISC | INTR_MPSAFE, NULL, nvme_qpair_msix_handler, qpair, &qpair->tag); } nvd1: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd1: 1144641MB (2344225968 512 byte sectors) nvd2: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd2: 1144641MB (2344225968 512 byte sectors) nvd3: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd3: 1144641MB (2344225968 512 byte sectors) nvd4: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd4: 1144641MB (2344225968 512 byte sectors) nvd5: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd5: 1144641MB (2344225968 512 byte sectors) nvd6: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd6: 1144641MB (2344225968 512 byte sectors) nvd7: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd7: 1144641MB (2344225968 512 byte sectors) nvd8: <MTFDHAL1T2MCF-1AN1ZABYY> NVMe namespace nvd8: 1144641MB (2344225968 512 byte sectors) panic: nvme_qpair_construct: bus_alloc_resource_any for IRQ failed cpuid = 40 time = 2 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82188170 vpanic() at vpanic+0x19c/frame 0xffffffff821881f0 panic() at panic+0x43/frame 0xffffffff82188250 nvme_qpair_construct() at nvme_qpair_construct+0x479/frame 0xffffffff821882c0 nvme_ctrlr_start_config_hook() at nvme_ctrlr_start_config_hook+0x1a3/frame 0xffffffff82188310 run_interrupt_driven_config_hooks() at run_interrupt_driven_config_hooks+0x110/frame 0xffffffff82188340 boot_run_interrupt_driven_config_hooks() at boot_run_interrupt_driven_config_hooks+0x22/frame 0xffffffff821883d0 mi_startup() at mi_startup+0x9c/frame 0xffffffff821883f0 btext() at btext+0x2c Uptime: 2s I reduced the number of IRQs required but probably am taking a hit on performance with: hw.nvme.min_cpus_per_ioq="2" At least the system will post now. Please try https://reviews.freebsd.org/P165 Well, this panic is actually a different one from the one P165 aims to fix, but the nvme driver is probably using 'mp_ncpus' to determine the number of IRQs to allocate which isn't going to work out very well. In particular, we don't schedule IRQs on HT threads by default, but only on the first thread in a core. This means if nvme is allocating 48 interrupts per device, it is actually allocating 2 interrupts for each core (e.g. 2 IRQs on CPU 0, 2 on CPU2, etc.) when not using bus_bind_intr(). If you have NUMA enabled, then we now also restrict IRQs to cores local to the device. Rather than using 'mp_ncpus', the driver should use bus_get_cpus() with INTR_CPUS to determine the set of CPUs it should bind interrupts to. It can then use 'CPU_COUNT' on the result in place of 'mp_ncpus'. (It might still be worth testing to see if P165 makes a difference.) ^Triage: reassign to pool by assignee request. I think this has been fixed, since it works on my system with 96 threads... This has been corrected for some time. We now properly scale for scarce interrupt resources on mega-core machines. |