Bug 199321 - Total MSI-X vector allocation limited to 191 vectors
Summary: Total MSI-X vector allocation limited to 191 vectors
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: John Baldwin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-09 16:59 UTC by Jim Harris
Modified: 2017-06-06 19:00 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jim Harris freebsd_committer freebsd_triage 2015-04-09 16:59:08 UTC
System Configuration:
2x 18C E5-v3 CPU (SMT enabled)
2x X520 Intel Ethernet 10Gb Adapter (4 ports total)
11.0-CURRENT (r281283)

ixgbe driver tries to allocate full range of 64 MSI-X vectors per port - first two ports succeed, but third and fourth ports fail and fall back to a single MSI vector.

msix_alloc() tries to use intr_next_cpu() to round-robin MSI-X vectors across all cores.  But driver initialization happens before APs are started, so intr_next_cpu() just returns the BSP's apic ID 0.  Each local APIC is limited to APIC_NUM_IOINTS==191 vectors, so a single local APIC cannot handle the 64 vectors for each of the 4 ports.

intr_shuffle_irqs runs in SI_SUB_SMP phase and is intended to shuffle iras off of core 0 to APs, but at this point it is too late - pci_alloc_msix() already reported to ixgbe(4) that it could not allocate all 64 vectors.

x86 defines NUM_MSI_INTS as 512, but this bug really restricts number of MSI-X vectors to APIC_NUM_IOINTS (191).  Impact is that some drivers/devices which can take advantage of multiple queues and one interrupt per queue must fall back to a single interrupt.  There may also be inconsistency across platforms based on device enumeration - whichever devices enumerate first will get its allotment of MSI-X vectors.

This problem will become exacerbated as NVMe SSDs become more prevalent, which also try to allocate 32-64 MSIx vectors each.  There are systems available already with 8 NVMe SSDs in a single system.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2015-04-15 17:35:58 UTC
Cc: jhb in case this is some code that he knows about.
Comment 2 John Baldwin freebsd_committer freebsd_triage 2015-04-15 17:41:10 UTC
Yes, this is a known issue, but the solutions I have in mind aren't entirely trivial.  The "best" solution in my mind is to flesh out the multi-pass boot-time probe stuff more fully that I started so that we are able to initialize things like timers early and then bring up the APs (and scheduler) before most drivers probe.  This would allow drivers to properly spread their interrupts across all CPUs from the start rather than having them all start off on 0.  The hackish approach is to change individual drivers to defer setting up interrupts until after SI_SUB_SMP via a custom SYSINIT.  That's not really great long term of course but would work as a workaround on older systems.
Comment 3 commit-hook freebsd_committer freebsd_triage 2016-05-14 18:23:01 UTC
A commit references this bug:

Author: jhb
Date: Sat May 14 18:22:54 UTC 2016
New revision: 299746
URL: https://svnweb.freebsd.org/changeset/base/299746

Log:
  Add an EARLY_AP_STARTUP option to start APs earlier during boot.

  Currently, Application Processors (non-boot CPUs) are started by
  MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
  SI_SUB_SMP at which point they are released to run kernel threads.
  SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
  the scheduler and start running threads until fairly late in the
  boot.

  This change moves SI_SUB_SMP up to just before software interrupt
  threads are created allowing the APs to start executing kernel
  threads much sooner (before any devices are probed).  This allows
  several initialization routines that need to perform initialization
  on all CPUs to now perform that initialization in one step rather
  than having to defer the AP initialization to a second SYSINIT run
  at SI_SUB_SMP.  It also permits all CPUs to be available for
  handling interrupts before any devices are probed.

  This last feature fixes a problem on with interrupt vector exhaustion.
  Specifically, in the old model all device interrupts were routed
  onto the boot CPU during boot.  Later after the APs were released at
  SI_SUB_SMP, interrupts were redistributed across all CPUs.

  However, several drivers for multiqueue hardware allocate N interrupts
  per CPU in the system.  In a system with many CPUs, just a few drivers
  doing this could exhaust the available pool of interrupt vectors on
  the boot CPU as each driver was allocating N * mp_ncpu vectors on the
  boot CPU.  Now, drivers will allocate interrupts on their desired CPUs
  during boot meaning that only N interrupts are allocated from the boot
  CPU instead of N * mp_ncpu.

  Some other bits of code can also be simplified as smp_started is
  now true much earlier and will now always be true for these bits of
  code.  This removes the need to treat the single-CPU boot environment
  as a special case.

  As a transition aid, the new behavior is available under a new kernel
  option (EARLY_AP_STARTUP).  This will allow the option to be turned off
  if need be during initial testing.  I plan to enable this on x86 by
  default in a followup commit in the next few days and to have all
  platforms moved over before 11.0.  Once the transition is complete,
  the option will be removed along with the !EARLY_AP_STARTUP code.

  These changes have only been tested on x86.  Other platform maintainers
  are encouraged to port their architectures over as well.  The main
  things to check for are any uses of smp_started in MD code that can be
  simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
  the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).

  PR:		kern/199321
  Reviewed by:	markj, gnn, kib
  Sponsored by:	Netflix

Changes:
  head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c
  head/sys/cddl/dev/dtrace/dtrace_load.c
  head/sys/cddl/dev/dtrace/i386/dtrace_subr.c
  head/sys/cddl/dev/dtrace/powerpc/dtrace_subr.c
  head/sys/conf/NOTES
  head/sys/conf/options
  head/sys/dev/acpica/acpi.c
  head/sys/dev/acpica/acpi_cpu.c
  head/sys/dev/hwpmc/hwpmc_mod.c
  head/sys/dev/hyperv/vmbus/hv_vmbus_drv_freebsd.c
  head/sys/dev/xen/control/control.c
  head/sys/geom/eli/g_eli.c
  head/sys/kern/kern_clock.c
  head/sys/kern/kern_clocksource.c
  head/sys/kern/kern_cpu.c
  head/sys/net/netisr.c
  head/sys/sys/kernel.h
  head/sys/x86/isa/clock.c
  head/sys/x86/x86/intr_machdep.c
  head/sys/x86/x86/local_apic.c
  head/sys/x86/x86/mca.c
  head/sys/x86/x86/mp_x86.c
Comment 4 commit-hook freebsd_committer freebsd_triage 2016-12-16 21:11:28 UTC
A commit references this bug:

Author: jhb
Date: Fri Dec 16 21:10:38 UTC 2016
New revision: 310177
URL: https://svnweb.freebsd.org/changeset/base/310177

Log:
  Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default.

  PR:		199321, 203682
  MFC after:	2 months
  Sponsored by:	Netflix

Changes:
  head/sys/amd64/conf/GENERIC
  head/sys/amd64/conf/MINIMAL
  head/sys/i386/conf/GENERIC
Comment 5 commit-hook freebsd_committer freebsd_triage 2017-05-24 00:01:39 UTC
A commit references this bug:

Author: jhb
Date: Wed May 24 00:00:56 UTC 2017
New revision: 318763
URL: https://svnweb.freebsd.org/changeset/base/318763

Log:
  MFC 310177: Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default.

  PR:		199321, 203682
  Discussed with:	re (kib)
  Relnotes:	yes

Changes:
_U  stable/11/
  stable/11/sys/amd64/conf/GENERIC
  stable/11/sys/amd64/conf/MINIMAL
  stable/11/sys/i386/conf/GENERIC
Comment 6 John Baldwin freebsd_committer freebsd_triage 2017-06-06 19:00:30 UTC
This change is too disruptive to merge back to 10.4.