We are having a problem with the ule scheduler scheduling a process to run on a processor that is monopolized by a realtime thread and in turn never running or getting "stolen" from that proccessor's queue to run on another. This happens readily in our application utilizing FreeBSD. Since it would be impossible for a 3rd party to run our software, I came up with a test suite that reproduces the issue, but it needs to run for awhile for the lockup to happen. To reproduce the issue, run three shells. One runs "many" which starts realtime threads on all but one available processor. Next run "runOne" which starts a realtime thread on the last available processor for a period of time, then exits, repeating the process. Finally, run "runIfconfig" which just calls ifconfig in a loop. Eventually ifconfig will get stuck in "RUN" state. Changing the "kern.sched.steal_thresh" to 1 has no effect. The source for the programs and scripts are after the test machines details. Snapshot of top showing the locked process (ifconfig) in the test: last pid: 89788; load averages: 8.02, 8.09, 9.71 up 0+19:15:59 14:40:09 49 processes: 9 running, 40 sleeping CPU: 87.5% user, 0.0% nice, 0.0% system, 0.0% interrupt, 12.5% idle Mem: 6784K Active, 50M Inact, 3087M Wired, 286M Buf, 28G Free Swap: 4096M Total, 4096M Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 89782 root 20 0 19600K 3660K ttyin 7 0:00 0.00% csh 89779 root 20 0 83092K 8288K select 7 0:00 0.00% sshd 89725 root 52 0 6244K 1996K nanslp 7 0:00 0.00% sleep 89717 root 52 0 13144K 2832K wait 7 0:00 0.00% sh 89125 root 20 0 19600K 3704K ttyin 7 0:00 0.00% csh 53534 root 20 0 8392K 2220K piperd 7 0:00 0.00% mail 53533 root 52 0 13144K 2816K wait 7 0:00 0.00% sh 53526 root 52 0 13144K 2808K wait 7 0:00 0.00% sh 53525 root 21 0 6244K 2004K wait 7 0:00 0.00% lockf 53523 root 21 0 13144K 2808K wait 7 0:00 0.00% sh 53521 root 20 0 13144K 2812K wait 7 0:00 0.00% sh 51764 root 20 0 8392K 2216K piperd 7 0:00 0.00% mail 51763 root 20 0 13144K 2816K wait 7 0:00 0.00% sh 51754 root 52 0 13144K 2808K wait 7 0:00 0.00% sh 51752 root 21 0 6244K 2004K wait 7 0:00 0.00% lockf 51748 root 21 0 13144K 2808K wait 7 0:00 0.00% sh 51746 root 20 0 12564K 2564K piperd 7 0:00 0.00% cron 38859 root 72 0 19104K 2768K RUN 6 0:00 0.00% ifconfig 6945 root 20 0 13460K 2452K nanslp 7 0:00 0.00% many{many} 6945 root -21 r31F 13460K 2452K CPU0 0 19.2H 99.06% many{many} 6945 root -21 r31F 13460K 2452K CPU1 1 19.2H 89.40% many{many} 6945 root -21 r31F 13460K 2452K CPU2 2 19.2H 100.00% many{many} 6945 root -21 r31F 13460K 2452K CPU3 3 19.2H 89.41% many{many} 6945 root -21 r31F 13460K 2452K CPU4 4 19.2H 89.40% many{many} 6945 root -21 r31F 13460K 2452K CPU5 5 19.2H 89.39% many{many} 6945 root -21 r31F 13460K 2452K CPU6 6 19.1H 99.31% many{many} 887 root 52 0 16664K 3984K wait 7 2:36 0.00% ksh93 ... Details of the test system: uname -a: FreeBSD fbdev 11.1-RELEASE-p9 FreeBSD 11.1-RELEASE-p9 #0: Tue Apr 3 16:59:16 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 root@fbl:~ # pciconf -l hostb0@pci0:0:0:0: class=0x060000 card=0x01588086 chip=0x01588086 rev=0x09 hdr=0x00 pcib1@pci0:0:1:0: class=0x060400 card=0x01518086 chip=0x01518086 rev=0x09 hdr=0x01 vgapci0@pci0:0:2:0: class=0x030000 card=0x21118086 chip=0x016a8086 rev=0x09 hdr=0x00 pcib2@pci0:0:6:0: class=0x060400 card=0x015d8086 chip=0x015d8086 rev=0x09 hdr=0x01 none0@pci0:0:22:0: class=0x078000 card=0x1c3a8086 chip=0x1c3a8086 rev=0x04 hdr=0x00 ehci0@pci0:0:26:0: class=0x0c0320 card=0x1c2d8086 chip=0x1c2d8086 rev=0x05 hdr=0x00 pcib3@pci0:0:28:0: class=0x060400 card=0x1c108086 chip=0x1c108086 rev=0xb5 hdr=0x01 ehci1@pci0:0:29:0: class=0x0c0320 card=0x1c268086 chip=0x1c268086 rev=0x05 hdr=0x00 pcib4@pci0:0:30:0: class=0x060401 card=0x244e8086 chip=0x244e8086 rev=0xa5 hdr=0x01 isab0@pci0:0:31:0: class=0x060100 card=0x1c568086 chip=0x1c568086 rev=0x05 hdr=0x00 ahci0@pci0:0:31:2: class=0x010601 card=0x1c028086 chip=0x1c028086 rev=0x05 hdr=0x00 none1@pci0:0:31:3: class=0x0c0500 card=0x1c228086 chip=0x1c228086 rev=0x05 hdr=0x00 ix0@pci0:1:0:0: class=0x020000 card=0xffffffff chip=0x10fb8086 rev=0x01 hdr=0x00 ix1@pci0:1:0:1: class=0x020000 card=0xffffffff chip=0x10fb8086 rev=0x01 hdr=0x00 igb0@pci0:2:0:0: class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00 igb1@pci0:2:0:1: class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00 igb2@pci0:3:0:0: class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00 igb3@pci0:3:0:1: class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00 none2@pci0:4:0:0: class=0x020000 card=0x012310ec chip=0x816710ec rev=0x10 hdr=0x00 dmeag -a: Copyright (c) 1992-2017 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.1-RELEASE-p9 #0: Tue Apr 3 16:59:16 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0) VT(vga): resolution 640x480 CPU: Intel(R) Xeon(R) CPU E3-1275 V2 @ 3.50GHz (3492.14-MHz K8-class CPU) Origin="GenuineIntel" Id=0x306a9 Family=0x6 Model=0x3a Stepping=9 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x7fbae3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM> AMD Features2=0x1<LAHF> Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS> XSAVE Features=0x1<XSAVEOPT> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics real memory = 34359738368 (32768 MB) avail memory = 33212313600 (31673 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <ALASKA A M I> FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 hardware threads random: unblocking device. ioapic0 <Version 2.0> irqs 0-23 on motherboard SMP: AP CPU #1 Launched! SMP: AP CPU #4 Launched! SMP: AP CPU #2 Launched! SMP: AP CPU #3 Launched! SMP: AP CPU #5 Launched! SMP: AP CPU #6 Launched! SMP: AP CPU #7 Launched! Timecounter "TSC-low" frequency 1746072482 Hz quality 1000 random: entropy device external interface kbd1 at kbdmux0 netmap: loaded module module_register_init: MOD_LOAD (vesa, 0xffffffff80f5eb40, 0) error 19 random: registering fast source Intel Secure Key RNG random: fast provider: "Intel Secure Key RNG" 0: virt=0xfffffe0859800000 phys=0x100000000 1: virt=0xfffffe0899800000 phys=0x140000000 nexus0 vtvga0: <VT VGA driver> on motherboard cryptosoft0: <software crypto> on motherboard acpi0: <ALASKA A M I> on motherboard acpi0: Power Button (fixed) cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 cpu2: <ACPI CPU> on acpi0 cpu3: <ACPI CPU> on acpi0 cpu4: <ACPI CPU> on acpi0 cpu5: <ACPI CPU> on acpi0 cpu6: <ACPI CPU> on acpi0 cpu7: <ACPI CPU> on acpi0 hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 Event timer "HPET" frequency 14318180 Hz quality 550 atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0 atrtc0: Warning: Couldn't map I/O. Event timer "RTC" frequency 32768 Hz quality 0 attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pcib0: _OSC returned error 0x10 pci0: <ACPI PCI bus> on pcib0 pcib1: <ACPI PCI-PCI bridge> irq 16 at device 1.0 on pci0 pci1: <ACPI PCI bus> on pcib1 ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k> port 0xe020-0xe03f mem 0xf7d20000-0xf7d3ffff,0xf7d44000-0xf7d47fff irq 16 at device 0.0 on pci1 ix0: Using MSIX interrupts with 9 vectors ix0: Ethernet address: 08:00:8a:00:10:e0 ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: netmap queues/slots: TX 8/2048, RX 8/2048 ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k> port 0xe000-0xe01f mem 0xf7d00000-0xf7d1ffff,0xf7d40000-0xf7d43fff irq 17 at device 0.1 on pci1 ix1: Using MSIX interrupts with 9 vectors ix1: Ethernet address: 08:00:8a:00:10:e1 ix1: PCI Express Bus: Speed 5.0GT/s Width x8 ix1: netmap queues/slots: TX 8/2048, RX 8/2048 vgapci0: <VGA-compatible display> port 0xf000-0xf03f mem 0xf7400000-0xf77fffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0 vgapci0: Boot video device pcib2: <ACPI PCI-PCI bridge> irq 19 at device 6.0 on pci0 pci2: <ACPI PCI bus> on pcib2 igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xd020-0xd03f mem 0xf7a80000-0xf7afffff,0xf7b04000-0xf7b07fff irq 19 at device 0.0 on pci2 igb0: Using MSIX interrupts with 9 vectors igb0: Ethernet address: 08:00:8a:00:00:82 igb0: Bound queue 0 to cpu 0 igb0: Bound queue 1 to cpu 1 igb0: Bound queue 2 to cpu 2 igb0: Bound queue 3 to cpu 3 igb0: Bound queue 4 to cpu 4 igb0: Bound queue 5 to cpu 5 igb0: Bound queue 6 to cpu 6 igb0: Bound queue 7 to cpu 7 igb0: netmap queues/slots: TX 8/1024, RX 8/1024 igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xd000-0xd01f mem 0xf7a00000-0xf7a7ffff,0xf7b00000-0xf7b03fff irq 16 at device 0.1 on pci2 igb1: Using MSIX interrupts with 9 vectors igb1: Ethernet address: 08:00:8a:00:00:83 igb1: Bound queue 0 to cpu 0 igb1: Bound queue 1 to cpu 1 igb1: Bound queue 2 to cpu 2 igb1: Bound queue 3 to cpu 3 igb1: Bound queue 4 to cpu 4 igb1: Bound queue 5 to cpu 5 igb1: Bound queue 6 to cpu 6 igb1: Bound queue 7 to cpu 7 igb1: netmap queues/slots: TX 8/1024, RX 8/1024 pci0: <simple comms> at device 22.0 (no driver attached) ehci0: <Intel Cougar Point USB 2.0 controller> mem 0xf7e04000-0xf7e043ff irq 16 at device 26.0 on pci0 usbus0: EHCI version 1.0 usbus0 on ehci0 usbus0: 480Mbps High Speed USB v2.0 pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0 pci3: <ACPI PCI bus> on pcib3 igb2: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xc020-0xc03f mem 0xf7880000-0xf78fffff,0xf7904000-0xf7907fff irq 16 at device 0.0 on pci3 igb2: Using MSIX interrupts with 9 vectors igb2: Ethernet address: 08:00:8a:00:00:80 igb2: Bound queue 0 to cpu 0 igb2: Bound queue 1 to cpu 1 igb2: Bound queue 2 to cpu 2 igb2: Bound queue 3 to cpu 3 igb2: Bound queue 4 to cpu 4 igb2: Bound queue 5 to cpu 5 igb2: Bound queue 6 to cpu 6 igb2: Bound queue 7 to cpu 7 igb2: netmap queues/slots: TX 8/1024, RX 8/1024 igb3: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xc000-0xc01f mem 0xf7800000-0xf787ffff,0xf7900000-0xf7903fff irq 17 at device 0.1 on pci3 igb3: Using MSIX interrupts with 9 vectors igb3: Ethernet address: 08:00:8a:00:00:81 igb3: Bound queue 0 to cpu 0 igb3: Bound queue 1 to cpu 1 igb3: Bound queue 2 to cpu 2 igb3: Bound queue 3 to cpu 3 igb3: Bound queue 4 to cpu 4 igb3: Bound queue 5 to cpu 5 igb3: Bound queue 6 to cpu 6 igb3: Bound queue 7 to cpu 7 igb3: netmap queues/slots: TX 8/1024, RX 8/1024 ehci1: <Intel Cougar Point USB 2.0 controller> mem 0xf7e03000-0xf7e033ff irq 23 at device 29.0 on pci0 usbus1: EHCI version 1.0 usbus1 on ehci1 usbus1: 480Mbps High Speed USB v2.0 pcib4: <ACPI PCI-PCI bridge> at device 30.0 on pci0 pci4: <ACPI PCI bus> on pcib4 re0: <RealTek 8169SC/8110SC Single-chip Gigabit Ethernet> port 0xb000-0xb0ff mem 0xf7c20000-0xf7c200ff irq 16 at device 0.0 on pci4 re0: Chip rev. 0x18000000 re0: MAC rev. 0x00000000 miibus0: <MII bus> on re0 rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0 rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow re0: Using defaults for TSO: 65518/35/2048 re0: Ethernet address: 00:90:0b:29:07:b2 re0: netmap queues/slots: TX 1/256, RX 1/256 isab0: <PCI-ISA bridge> at device 31.0 on pci0 isa0: <ISA bus> on isab0 ahci0: <Intel Cougar Point AHCI SATA controller> port 0xf0b0-0xf0b7,0xf0a0-0xf0a3,0xf090-0xf097,0xf080-0xf083,0xf060-0xf07f mem 0xf7e02000-0xf7e027ff irq 19 at device 31.2 on pci0 ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported ahcich2: <AHCI channel> at channel 2 on ahci0 ahciem0: <AHCI enclosure management bridge> on ahci0 acpi_button0: <Power Button> on acpi0 acpi_tz0: <Thermal Zone> on acpi0 acpi_tz1: <Thermal Zone> on acpi0 ppc1: <Parallel port> port 0x378-0x37f irq 5 on acpi0 ppc1: Generic chipset (NIBBLE-only) in COMPATIBLE mode ppbus0: <Parallel port bus> on ppc1 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] ppc0: cannot reserve I/O port range est0: <Enhanced SpeedStep Frequency Control> on cpu0 est1: <Enhanced SpeedStep Frequency Control> on cpu1 est2: <Enhanced SpeedStep Frequency Control> on cpu2 est3: <Enhanced SpeedStep Frequency Control> on cpu3 est4: <Enhanced SpeedStep Frequency Control> on cpu4 est5: <Enhanced SpeedStep Frequency Control> on cpu5 est6: <Enhanced SpeedStep Frequency Control> on cpu6 est7: <Enhanced SpeedStep Frequency Control> on cpu7 Timecounters tick every 1.000 msec nvme cam probe device init ugen1.1: <Intel EHCI root HUB> at usbus1 ugen0.1: <Intel EHCI root HUB> at usbus0 uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1 uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus0 ses0 at ahciem0 bus 0 scbus1 target 0 lun 0 ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device ses0: SEMB SES Device ada0 at ahcich2 bus 0 scbus0 target 0 lun 0 ada0: <TS16GCF160 20111219> ATA SATA 1.x device ada0: Serial Number A2323035180354000072 ada0: 150.000MB/s transfers (SATA 1.x, UDMA2, PIO 512bytes) ada0: 15280MB (31293440 512 byte sectors) Trying to mount root from ufs:/dev/ada0p3 [rw]... Setting hostuuid: c84483e3-9171-11e6-a133-08008a0010e0. Setting hostid: 0x224a41c7. Starting file system checks: /dev/ada0p3: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ada0p3: clean, 519318 free (6518 frags, 64100 blocks, 0.2% fragmentation) uhub0: 2 ports with 2 removable, self powered uhub1: 2 ports with 2 removable, self powered Mounting local filesystems:. ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/gcc49 /usr/local/lib/perl5/5.20/mach/CORE /etc/ld-elf.so.conf 32-bit compatibility ldconfig path: /usr/lib32 Setting hostname: fbl. Setting up harvesting: [UMA],[FS_ATIME],SWI,INTERRUPT,NET_NG,NET_ETHER,NET_TUN,MOUSE,KEYBOARD,ATTACH,CACHED Feeding entropy: ugen0.2: <vendor 0x8087 product 0x0024> at usbus0 uhub2 on uhub1 uhub2: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus0 ugen1.2: <vendor 0x8087 product 0x0024> at usbus1 uhub3 on uhub0 uhub3: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus1 . uhub2: 6 ports with 6 removable, self powered uhub3: 8 ports with 8 removable, self powered ugen1.3: <vendor 0x0d3d USBPS2> at usbus1 ukbd0 on uhub3 ukbd0: <EP1> on usbus1 kbd2 at ukbd0 igb2: link state changed to UP Starting Network: lo0 ix0 ix1 igb0 igb1 igb2 igb3 re0. lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8 inet 127.0.0.1 netmask 0xff000000 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> groups: lo ix0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:10:e0 hwaddr 08:00:8a:00:10:e0 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:10:e1 hwaddr 08:00:8a:00:10:e1 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier igb0: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:00:82 hwaddr 08:00:8a:00:00:82 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier igb1: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:00:83 hwaddr 08:00:8a:00:00:83 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier igb2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:00:80 hwaddr 08:00:8a:00:00:80 inet 206.210.221.193 netmask 0xffffff00 broadcast 206.210.221.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (100baseTX <full-duplex>) status: active igb3: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:00:81 hwaddr 08:00:8a:00:00:81 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier re0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE> ether 00:90:0b:29:07:b2 hwaddr 00:90:0b:29:07:b2 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> re0: link state changed to DOWN media: Ethernet autoselect (10baseT/UTP <half-duplex>) status: no carrier Starting devd. Starting Network: ix0. ix0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:10:e0 hwaddr 08:00:8a:00:10:e0 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier Starting Network: ix1. ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:10:e1 hwaddr 08:00:8a:00:10:e1 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier Starting Network: igb0. igb0: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:00:82 hwaddr 08:00:8a:00:00:82 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier Starting Network: igb1. igb1: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:00:83 hwaddr 08:00:8a:00:00:83 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier Starting Network: igb3. igb3: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 08:00:8a:00:00:81 hwaddr 08:00:8a:00:00:81 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier Starting Network: re0. re0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE> ether 00:90:0b:29:07:b2 hwaddr 00:90:0b:29:07:b2 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (10baseT/UTP <half-duplex>) status: no carrier add host 127.0.0.1: gateway lo0 fib 0: route already in table add net default: gateway 206.210.221.1 add host ::1: gateway lo0 fib 0: route already in table add net fe80::: gateway ::1 add net ff02::: gateway ::1 add net ::ffff:0.0.0.0: gateway ::1 add net ::0.0.0.0: gateway ::1 Generating host.conf. Mounting NFS filesystems:. ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/gcc49 /usr/local/lib/perl5/5.20/mach/CORE /etc/ld-elf.so.conf 32-bit compatibility ldconfig path: /usr/lib32 Creating and/or trimming log files. Starting syslogd. Sources: many.c: #include <pthread.h> #include <pthread_np.h> #include <sched.h> #include <stdio.h> #include <unistd.h> #include <sys/sysctl.h> static int cpu = 0; void *run (void *data) { cpuset_t cpuset; struct sched_param scheduleParam; CPU_ZERO(&cpuset); CPU_SET(cpu, &cpuset); cpu++; int err = pthread_setaffinity_np(pthread_self(), sizeof(cpuset_t), &cpuset); printf("pthread_setaffinity_np: %d\n", err); sched_getparam(0, &scheduleParam); scheduleParam.sched_priority = sched_get_priority_min(SCHED_FIFO); sched_setscheduler(0, SCHED_FIFO, &scheduleParam); while (1) ; } int main (void) { pthread_t tp[32]; uint32_t nCpu = 0; size_t cpuLen = sizeof nCpu; if (sysctlbyname("hw.ncpu", &nCpu, &cpuLen, NULL, 0) >= 0 && cpuLen == sizeof nCpu) { printf("%d CPUs available, starting realtime processes on %d of them\n", nCpu, nCpu - 1); for (int ix = 0; ix < nCpu - 1; ++ix) { pthread_create(&tp[ix], NULL, run, NULL); } while (1) sleep(1); } return (0); } one.c: #include <pthread.h> #include <pthread_np.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <net/if.h> #include <net/if_mib.h> #include <sys/sysctl.h> int getIfData (struct ifmibdata *mib, char *name) { size_t len; int idx, pkt[6]; idx = if_nametoindex(name); if (idx == 0) { return -1; } memset(mib, 0, sizeof(struct ifmibdata)); pkt[0] = CTL_NET; pkt[1] = PF_LINK; pkt[2] = NETLINK_GENERIC; pkt[3] = IFMIB_IFDATA; pkt[4] = idx; pkt[5] = IFDATA_GENERAL; len = sizeof(struct ifmibdata); if (sysctl(pkt, 6, mib, &len, NULL, 0) == -1) { return -1; } return idx; } uint64_t nanoseconds (struct timespec* ts) { return ts->tv_sec * (uint64_t)1000000000L + ts->tv_nsec; } int main (int argc, char *argv[]) { if (argc < 2) { printf("Please supply number of seconds to run\n"); exit(1); } double secs = atof(argv[1]); uint32_t nCpu = 0; size_t cpuLen = sizeof nCpu; if (sysctlbyname("hw.ncpu", &nCpu, &cpuLen, NULL, 0) >= 0 && cpuLen == sizeof nCpu) { cpuset_t cpuset; struct sched_param scheduleParam; CPU_ZERO(&cpuset); CPU_SET(nCpu - 1, &cpuset); pthread_setaffinity_np(pthread_self(), sizeof(cpuset_t), &cpuset); sched_getparam(0, &scheduleParam); scheduleParam.sched_priority = sched_get_priority_min(SCHED_FIFO); sched_setscheduler(0, SCHED_FIFO, &scheduleParam); printf("Running for %s second(s) on cpu %d\n", argv[1], nCpu - 1); struct timespec tsNow, tsEnd; tsEnd.tv_sec = (int)secs; tsEnd.tv_nsec = (double)(secs - tsEnd.tv_sec) * (uint64_t)1000000000L; clock_gettime(CLOCK_REALTIME, &tsNow); uint64_t start = nanoseconds(&tsNow); uint64_t stop = start + nanoseconds(&tsEnd); while (clock_gettime(CLOCK_REALTIME, &tsNow) == 0 && nanoseconds(&tsNow) < stop) { struct ifmibdata mib; getIfData(&mib, "lo0"); } } return (0); } compile: cc one.c -lthr -o one cc many.c -lthr -o many runOne: #!/bin/ksh while true; do ./one 1.1; done; runIfconfig: #!/bin/ksh while true; do ifconfig; done;
Don, I've CC'ed you as you have previously had some interest in ULE. Feel free to un-CC if you're not interested :-).
Thread stealing is only done in the CPU idle loop (or more recently, when ULE detects that the CPU is about to go idle). Since none of the CPUs is idle in this case, stealing will not happen. Setting kern.sched.steal_thresh=1 won't make a difference since the stuck thread raises the load to 2 on the CPU is where it is assigned. In theory sched_balance() should eventually move the stuck thread to the CPU is running "runOne", but it runs infrequently and if the destination CPU is chosen totally randomly, that could take a while. There could be an issue in how it chooses the destination CPU that prevents it from ever choosing the CPU where "runOne" is running ...
Don- My problem is that once it is "stuck" I can stop the runOne script and free up the remaining CPU and the ifconfig process will remain on the busy CPU. The machine is sitting 12.5% idle but the process never migrates. The "top" snapshot I included shows this.
Created attachment 193209 [details] fix sched_balance (In reply to Don Lewis from comment #2) Hmm, I think sched_balance() is actually busted: the initial call is performed while smp_started == 0, so balance_ticks is never set to a non-zero value. The attached patch seems to address that at least.
(In reply to Mark Johnston from comment #4) Looks reasonable to me.
(In reply to Mark Johnston from comment #4) That is, this is due to this PR? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223914
(In reply to Ivan Klymenko from comment #6) Could be, I'm hoping that the submitter tries applying the patch to see if the problem persists. I didn't realize a PR already existed for sched_balance(). I'll work on getting it fixed.
Sorry for the delay! I tried the patch out and sched_balance_group now runs but the I can still get a process stuck in the RUN state on a busy CPU. FWIW, it doesn't seem like that mechanism ever moves processes. All my calls to tdq_move comes from the tdq_idled loop. I'm debugging why the process doesn't get moved now. Weird thing is other processes are successfully getting moved from the busy CPUs from the tdq_idled loop. Just not the stuck process. Don't know if this helps, but it seems like if I start another process, something triggers and it will move the stuck process and everything will run fine for awhile. Any clues from that?
(In reply to Mit Matelske from comment #3) Hmn, the idle CPU should then be able to steal the stuck process. Maybe it is stuck in some state where migration is forbidden ...
Looks like td_pinned gets set so the thread can't move from the busy processor. Not sure why that occasionally happens?
(In reply to Mit Matelske from comment #10) The kernel stack of the thread will provide a clue, if you're able to obtain it. procstat -kk <pid> can grab the kernel stack of a running thread.
# procstat -kk 1557 PID TID COMM TDNAME KSTACK 1557 100825 ifconfig - mi_switch+0xe5 critical_exit+0x7a ipi_bitmap_handler+0x79 ipi_intr_bitmap_handler_u+0x9c userland_sysctl+0x157 sys___sysctl+0x5f amd64_syscall+0x6c4 fast_syscall_common+0x106
I'm not able to reproduce this on HEAD. I'll try setting up an 11.1 VM for testing.
I'm a bit confused by the repro. I presume the expected behaviour is that the ifconfig thread should be migrated to the CPU running "one". But in your sample "top" output, the last CPU is idle even though you're running "one" in a loop there - how can that be?
In the top output, I had stopped the "one" program to show that even if the processor goes idle the ifconfig never gets migrated to 7. Sorry for the confusion.
Is this reproducible on 12.2 or later?
I think td_pinned is the problem. The thread can't move until that flag is cleared, and that can't happen until it is able to get some CPU cycles to get through the section of code that td_pinned is protecting. I can see how this could happen in the general case, but in this particular case how does the ifconfig thread manage to make enough progress to set td_pinned while competing with the hard realtime thread that is hogging the cpu?