System Configuration: 2x 18C E5-v3 CPU (SMT enabled) 2x X520 Intel Ethernet 10Gb Adapter (4 ports total) 11.0-CURRENT (r281283) ixgbe driver tries to allocate full range of 64 MSI-X vectors per port - first two ports succeed, but third and fourth ports fail and fall back to a single MSI vector. msix_alloc() tries to use intr_next_cpu() to round-robin MSI-X vectors across all cores. But driver initialization happens before APs are started, so intr_next_cpu() just returns the BSP's apic ID 0. Each local APIC is limited to APIC_NUM_IOINTS==191 vectors, so a single local APIC cannot handle the 64 vectors for each of the 4 ports. intr_shuffle_irqs runs in SI_SUB_SMP phase and is intended to shuffle iras off of core 0 to APs, but at this point it is too late - pci_alloc_msix() already reported to ixgbe(4) that it could not allocate all 64 vectors. x86 defines NUM_MSI_INTS as 512, but this bug really restricts number of MSI-X vectors to APIC_NUM_IOINTS (191). Impact is that some drivers/devices which can take advantage of multiple queues and one interrupt per queue must fall back to a single interrupt. There may also be inconsistency across platforms based on device enumeration - whichever devices enumerate first will get its allotment of MSI-X vectors. This problem will become exacerbated as NVMe SSDs become more prevalent, which also try to allocate 32-64 MSIx vectors each. There are systems available already with 8 NVMe SSDs in a single system.
Cc: jhb in case this is some code that he knows about.
Yes, this is a known issue, but the solutions I have in mind aren't entirely trivial. The "best" solution in my mind is to flesh out the multi-pass boot-time probe stuff more fully that I started so that we are able to initialize things like timers early and then bring up the APs (and scheduler) before most drivers probe. This would allow drivers to properly spread their interrupts across all CPUs from the start rather than having them all start off on 0. The hackish approach is to change individual drivers to defer setting up interrupts until after SI_SUB_SMP via a custom SYSINIT. That's not really great long term of course but would work as a workaround on older systems.
A commit references this bug: Author: jhb Date: Sat May 14 18:22:54 UTC 2016 New revision: 299746 URL: https://svnweb.freebsd.org/changeset/base/299746 Log: Add an EARLY_AP_STARTUP option to start APs earlier during boot. Currently, Application Processors (non-boot CPUs) are started by MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until SI_SUB_SMP at which point they are released to run kernel threads. SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter the scheduler and start running threads until fairly late in the boot. This change moves SI_SUB_SMP up to just before software interrupt threads are created allowing the APs to start executing kernel threads much sooner (before any devices are probed). This allows several initialization routines that need to perform initialization on all CPUs to now perform that initialization in one step rather than having to defer the AP initialization to a second SYSINIT run at SI_SUB_SMP. It also permits all CPUs to be available for handling interrupts before any devices are probed. This last feature fixes a problem on with interrupt vector exhaustion. Specifically, in the old model all device interrupts were routed onto the boot CPU during boot. Later after the APs were released at SI_SUB_SMP, interrupts were redistributed across all CPUs. However, several drivers for multiqueue hardware allocate N interrupts per CPU in the system. In a system with many CPUs, just a few drivers doing this could exhaust the available pool of interrupt vectors on the boot CPU as each driver was allocating N * mp_ncpu vectors on the boot CPU. Now, drivers will allocate interrupts on their desired CPUs during boot meaning that only N interrupts are allocated from the boot CPU instead of N * mp_ncpu. Some other bits of code can also be simplified as smp_started is now true much earlier and will now always be true for these bits of code. This removes the need to treat the single-CPU boot environment as a special case. As a transition aid, the new behavior is available under a new kernel option (EARLY_AP_STARTUP). This will allow the option to be turned off if need be during initial testing. I plan to enable this on x86 by default in a followup commit in the next few days and to have all platforms moved over before 11.0. Once the transition is complete, the option will be removed along with the !EARLY_AP_STARTUP code. These changes have only been tested on x86. Other platform maintainers are encouraged to port their architectures over as well. The main things to check for are any uses of smp_started in MD code that can be simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in the EARLY_AP_STARTUP case (e.g. the interrupt shuffling). PR: kern/199321 Reviewed by: markj, gnn, kib Sponsored by: Netflix Changes: head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c head/sys/cddl/dev/dtrace/dtrace_load.c head/sys/cddl/dev/dtrace/i386/dtrace_subr.c head/sys/cddl/dev/dtrace/powerpc/dtrace_subr.c head/sys/conf/NOTES head/sys/conf/options head/sys/dev/acpica/acpi.c head/sys/dev/acpica/acpi_cpu.c head/sys/dev/hwpmc/hwpmc_mod.c head/sys/dev/hyperv/vmbus/hv_vmbus_drv_freebsd.c head/sys/dev/xen/control/control.c head/sys/geom/eli/g_eli.c head/sys/kern/kern_clock.c head/sys/kern/kern_clocksource.c head/sys/kern/kern_cpu.c head/sys/net/netisr.c head/sys/sys/kernel.h head/sys/x86/isa/clock.c head/sys/x86/x86/intr_machdep.c head/sys/x86/x86/local_apic.c head/sys/x86/x86/mca.c head/sys/x86/x86/mp_x86.c
A commit references this bug: Author: jhb Date: Fri Dec 16 21:10:38 UTC 2016 New revision: 310177 URL: https://svnweb.freebsd.org/changeset/base/310177 Log: Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default. PR: 199321, 203682 MFC after: 2 months Sponsored by: Netflix Changes: head/sys/amd64/conf/GENERIC head/sys/amd64/conf/MINIMAL head/sys/i386/conf/GENERIC
A commit references this bug: Author: jhb Date: Wed May 24 00:00:56 UTC 2017 New revision: 318763 URL: https://svnweb.freebsd.org/changeset/base/318763 Log: MFC 310177: Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default. PR: 199321, 203682 Discussed with: re (kib) Relnotes: yes Changes: _U stable/11/ stable/11/sys/amd64/conf/GENERIC stable/11/sys/amd64/conf/MINIMAL stable/11/sys/i386/conf/GENERIC
This change is too disruptive to merge back to 10.4.